Skip to main content

Full text of "Reasoning about Independence in Probabilistic Models of Relational Data"

See other formats


Journal of Machine Learning Research XX (2013) 1-46 



Submitted 2/13; Published XX/XX 



Reasoning about Independence 

in Probabilistic Models of Relational Data 

Marc Maier maier@cs.umass.edu 

Katerina Marazopoulou kmarazo@cs.umass.edu 

David Jensen jensen@cs.umass.edu 
School of Computer Science 
University of Massachusetts Amherst 
Amherst, MA 01003, USA 



Editor: 



Abstract 

Bayesian networks leverage conditional independence to compactly encode joint probability 
distributions. Many learning algorithms exploit the constraints implied by observed condi- 
tional independencies to learn the structure of Bayesian networks. The rules of rf-separation 
provide a theoretical and algorithmic framework for deriving conditional independence facts 
from model structure. However, this theory only applies to Bayesian networks. Many real- 
world systems, such as social or economic systems, are characterized by interacting hetero- 
geneous entities and probabilistic dependencies that cross the boundaries of entities. Con- 
sequently, researchers have developed extensions to Bayesian networks that can represent 
these relational dependencies. We show that the theory of rf-separation inaccurately infers 
conditional independence when applied directly to the structure of probabilistic models 
of relational data. We introduce relational d-separation, a theory for deriving conditional 
independence facts from relational models. We provide a new representation, the abstract 
ground graph, that enables a sound, complete, and computationally efficient method for 
answering rf-separation queries about relational models, and we present empirical results 
that demonstrate effectiveness. 

Keywords: relational models, d-separation, conditional independence, lifted representa- 
tions, directed graphical models 



1. Introduction 

Bayesian networks are a widely used class of graphical models that are capable of compactly 
representing a joint probability distribution over a set of variables. The joint distribution 
can be factorized into a product of much smaller conditional distributions by assuming 
that variables are independent of their non-descendants given their parents (the Markov 
condition). The Markov condition ties the structure of the model to the set of conditional 
independencies that hold over all probability distributions the model can represent. Accu- 
rate reasoning about such conditional independence facts is the basis for constraint-based 
algorithms, such as PC, FCI, and MMHC, that are commonly used to learn the structure of 



Bayesian networks (Spirtes et al. , 2000 Tsamardinos et al. , 2006). Under a small number 



©2013 Marc Maier and Katerina Marazopoulou and David Jensen. 



Maier, Marazopoulou, and Jensen 



of assumptions and with knowledge of the conditional independencies, these algorithms can 



identify causal structure ( Pearl , 2000 Spirtes et al. , 2000 ) . 



Deriving the full set of conditional independencies implied by the Markov condition is 
complex, requiring manipulation of the joint distribution and various probability axioms. 
Fortunately, the exact same set of conditional independencies entailed by the Markov con- 
dition are also entailed by d-separation, a set of graphical rules that algorithmically derive 
conditional independence facts directly from the model structure. That is, the Markov 
condition and d-separation are equivalent approaches for producing conditional indepen- 
dence from Bayesian networks ( Neapolitan! 2004). When interpreting a Bayesian network 
causally, the causal Markov condition (variables are independent of their non-effects given 
their direct causes) and d-separation have been shown to provide the correct connection 
between causal structure and conditional independence (Scheines, 1997). 

Bayesian networks assume that data instances are independent and identically dis- 
tributed, but many real- world systems are characterized by interacting heterogeneous enti- 
ties. For example, social network data consist of individuals, groups, and their relationships; 
citation data involve researchers collaborating and authoring scholarly papers that cite prior 
work; and sports data comprise players, coaches, teams, referees, and their competitive 
interactions. Over the past 15 years, researchers in statistics and computer science have 
devised more expressive classes of directed graphical models, such as probabilistic relational 
models (PRMs), which remove the assumptions of independent and identically distributed 
instances to more accurately describe these types of domains (Getoor and Taskar, 2007). 
Relational models generalize models that incorporate interference, spillover effects, or viola- 



tions of the stable unit treatment value assumption (SUTVA) (Hudgens and Halloran, 2008 



Tchetgen and VanderWeele 2012) and multilevel or hierarchical models (Gelman and Hill 



2007). Many practical applications have also benefited from learning and reasoning with 



relational models. Examples include analysis of gene regulatory interactions (Segal et al. 



2001'), scholarly citations (Taskar et al. , 2001), and biological cellular networks (Friedman 
2004j). 



Relational models also assume the Markov condition for connecting the underlying joint 
distribution with the structure of the model. Just like Bayesian networks, knowledge of 
conditional independencies that hold over relational data is important for relational models. 
Relational data typically encompass a much larger space of variables; thus, the ability to 
factorize the joint distribution into a set of smaller conditional distributions is crucial for 
representation and inference. Furthermore, constraint-based algorithms for learning the 
structure of relational models, such as Relational PC (Maier et al. 2010), are more efficient 
than searching the entire model space, but they rely on conditional independence facts to 
discover model structure. 

In this paper, we show that d-separation does not correctly produce conditional inde- 
pendence facts when applied directly to relational models. We introduce an alternative 
representation, the abstract ground graph, that enables an algorithm for deriving condi- 
tional independence facts from relational models. We show that this algorithm is sound, 
complete, and computationally efficient, and we provide an empirical demonstration of the 
effectiveness of our approach across synthetic causal structures of relational domains. 

In Section [2| we present an example that highlights why standard d-separation is in- 
adequate for relational models. Section [s] offers background on propositional (Bayesian 



2 



Independence in Models of Relational Data 



network) models, and Section |4] formalizes the fundamental concepts of relational data and 
models. In Section [5| we define relational d-separation and abstract ground graphs, and we 
present our soundness and completeness results. In Section [6j we evaluate how frequently 
d-separation on relational models is equivalent to d-separation on abstract ground graphs, 
and Section [7] contains additional experimental results. Finally, Section [8] discusses the 
implications that relational d-separation has for representation and learning and provides 
future directions for this work. 

1.1 Why d-separation is Useful 

A Bayesian network, as a model of a joint probability distribution, enables a wide array of 
useful tasks by supporting inference over an entire system of variables. They have been suc- 
cessfully applied to many domains, ranging from bioinformatics and medicine to computer 
vision and information retrieval. Naively specifying a joint distribution by hand requires 
an exponential number of states; however, Bayesian networks represent joint probability 
distributions that can be factorized into a product of conditional probability distributions. 
Bayesian networks leverage the Markov condition to represent conditional independencies 
in order to compactly specify a joint probability distribution. 

Alternative to the Markov condition, but equivalent in its implications, d-separation 
provides an algorithmic framework to derive the conditional independencies encoded by 
the model. These conditional independence facts are guaranteed to hold in every faithful 
joint distribution the model represents and, consequently, any data instance sampled from 
those distributions. The semantics of holding across all distributions is the main reason 
why d-separation is a useful theory. 

Causal discovery, the task of learning generative models of observational data, superfi- 
cially appears to be a futile endeavor. Yet learning and reasoning about the causal structure 
underlying real domains is a primary goal for many researchers. Fortunately, d-separation 
offers a connection between causal structure and conditional independence. The theory of 
d-separation can be leveraged to constrain the hypothesis space by eliminating models that 
are inconsistent with observed conditional independence facts. While many distributions 
do not lead to uniquely identifiable models, this approach (under simple assumptions) fre- 
quently discovers useful causal knowledge for domains that can be represented as a Bayesian 
network. 

As described above, relational models more closely represent the real- world domains 
that many social scientists and other researchers investigate. These models can represent 
systems of interacting heterogeneous entities and probabilistic dependencies that cross entity 
boundaries. In order to successfully learn causal models of observational relational data, we 
need a similar theory for deriving conditional independence from relational models. In this 
paper, we formalize the theory of relational d-separation, which will provide a theoretical 
framework for algorithms that learn causal models of relational domains. 

2. Example 

Consider a corporate analyst who was hired to identify which products and employees are 
effective and productive for some organization. If the company is structured as a pure 
project-based organization, the analyst may collect data as described by the relational 



3 



Maier, Marazopoulou, and Jensen 



EMPLOYEE 




C!! Z Su ccess 

PRODUCT 




<^ Budget 



Revenue 
BUSINESS-UNIT 



(a) Example relational schema for an organization consisting of employees working on products, which are 
funded by specific business units within a corporation. 



^ Salary^ 

C 3^omp e tence^^ 
Quinn 



C;;^ Salary ^ 


DEVELOPS 


/ 


^jjCompetenceJ^ 
Paul 




^Success^ 
Case 




^^uccess^ 
Adapter 





^^ Com pet ence] ^ 
Roger 



r 



Budget 



Devices 



^l^ompetence]^ 
Thomas 



C^^I^SLJCcess^^ 




^^Success^ 


Laptop 




Tablet 




(b) Example fragment of a relational skeleton. Roger and Sally are employees, both of whom develop the 
Laptop product, but only Sally works on product Tablet. Both products Laptop and Tablet are funded by 
business unit Devices. 

Figure 1: An example relational schema and skeleton for the organization domain. 



schema in Figure 1(a), The schema denotes that employees can collaborate and work on 
multiple products, each of which is funded by a specific business unit. The analyst has also 
obtained variables on each entity — salary and competence of employees, the success of each 
product, and the budget and revenue of business units. In this example, the organization 
consists of five employees, five products, and two business units, which are shown in the 



relational skeleton in Figure 1(b) 



The analyst may believe that the organization operates under the model depicted in 



Figure 2(a) For example, the competence of an employee affects the success of products 
they develop, and the revenue of a business unit is infiuenced by the success of products 
that it funds. The analyst then needs to verify the model structure in order to accurately 
advise executive decisions, such as determining which business units should have increased 
funding, which employees are being fairly or unfairly compensated, etc. Perhaps the analyst 
has experience in graphical models and decides to check that the conditional independencies 
encoded by the model are reflected in the data, assuming the faithfulness condition (see 
Section [s]). The analyst then naively applies d-separation to the model structure in an 
attempt to derive these conditional independencies. However, applying (i-separation directly 
to relational models does not correctly derive the set of conditional independencies. In other 
words, d-separation applied directly to relational models is not equivalent to the Markov 
condition. 



4 



Independence in Models of Relational Data 




[EMPLOYEE, DEVELOPS, PRODUCT, FUNDS, BUSINESS-UNITl.Budget [EMPLOYEEJ.Salary 

[EMPLOYEE].Competence (EMPLOYEE].Salary 

[PRODUCT, DEVELOPS, EMPLOYEEj.Competence [PRODUCTj.Success 

[BUSINESS-UNIT FUNDS, PRODUCTj.Success [BUSINESS-UNITJ.Revenue 

[BUSINESS-UNIT].Revenue [BUSINESS-UNITj.Budget 



(a) Example relational model. Competence of employees cause the success of products they develop, which 
in turn influences the revenue received by the business unit funding the product. Additional dependencies 
involve the budget of business units and employee salaries. The dependencies are specified by relational 
paths, listed below the graphical model. 




(b) Example fragment of a ground graph. The success of product Laptop is influenced by the competence 
of both Roger and Sally. The revenue of business unit Devices is caused by the success of all its funded 
products — Laptop, Tablet, and Smartphone. 



Figure 2: An example relational model and ground graph for the organization domain. 



Naively applying d-separation to the model in Figure 2(a)| suggests that employee com- 
petence is conditionally independent of the revenue of business units given the success of 
products. To see why this approach is flawed, we must consider the ground graph. A nec- 
essary precondition for inference is to apply a model to a data instantiation, which yields 
a ground graph to which d-separation can be applied. For a Bayesian network, a ground 
graph consists of replicates of the model structure for each data instance. In contrast, a 
relational model defines a template for how dependencies apply to a data instantiation, re- 
sulting in a ground graph with varying structure (see Section [1] for more details on ground 
graphs) . 

Figure 2(b) [ shows the ground graph for the relational model in Figure 2(a) applied to the 
relational skeleton in Figure 1(b) This ground graph illustrates that simply conditioning on 
product success can activate a path through the competence of other employees who develop 
the same products — we call this a relational d-connecting path^ Checking d-separation on 



1. The indirect effect attributed to a relational rf-connecting path is often referred to as interference, a 
spillover effect, or a violation of the stable unit treatment value assumption (SUTVA) because the 
treatment of one instance (employee competence) affects the outcome of another (the revenue of another 
employee's business unit). 



5 



Maier, Marazopoulou, and Jensen 



the ground graph indicates that to (i-separate employee competence from business unit 
revenue, we must not only condition on the success of developed products, but also on the 
competence of other employees who work on those products (e.g., Roger. Competence _LL 
Devices. -Refenue | {Laptop. Success, Sally.Competence}). 

This example also demonstrates that the Markov condition can be violated when di- 
rectly applied to the structure of a relational model. In this case, the Markov condition 
applied to the model in Figure 2(a) implies that P {Competence, Revenue \ Success) = 
P{Competence \ Success) P{Revenue \ Success), that revenue is independent of its non- 
descendants (competence) given its parents (success). However, the ground graph shows 
the opposite, for example, P (Rogei. Competence, Devices. Revenue \ Laptop. Success) / 
P (Roger. Competence [Laptop. Success) P(Devices. -Reuenue | Laptop. Success). This is not 
surprising since the conditional independencies entailed by (i-separation are known to be 
equivalent to those entailed by the Markov condition (Neapolitan, 2004). In fact, this is not 
the only conditional independence fact for which d-separation produces an incorrect result 
for this example model. Over 80% of the pairs of relational variables cannot be described 
by direct inspection of the model, and of those that can (such as the above example), 91% 
yield an incorrect conclusion. This is a single data point of a larger empirical evaluation 
presented in Section [6j 

It might appear that, since the standard rules of d-separation apply to Bayesian net- 
works and the ground graphs of relational models are also Bayesian networks, that ap- 
plying d-separation to relational models is a non-issue. However, applying d-separation 
to a single ground graph may result in potentially unbounded runtime if the instantiation 
is large — especially given that relational databases can be arbitrarily large. Further, and 
more importantly, the semantics of d-separation require that conditional independencies 
hold across all possible model instantiations. Although d-separation can apply directly to 
a ground graph, these semantics prohibits reasoning about a single ground graph. 

The conditional independence facts derived from (/-separation hold for all faithful dis- 
tributions represented a Bayesian network. Therefore, the implications of relational d- 
separation should analogously hold for all faithful distributions of variables for the space 
of all possible ground graphs. It is simple to show that ^-separation holds for any ground 
graph of a Bayesian network — every ground graph consists of a set of disconnected sub- 
graphs, each of which has a structure that is identical to that of the model. However, 
relational models are templates, and the resulting ground graphs vary with the relational 
structure of the underlying data (e.g., different products are developed by varying numbers 
of employees). As a result, relational ^-separation queries must be answered without re- 
spect to ground graphs. Additionally, the example illustrates how relational dependencies 
can exhibit d-connecting paths that are only manifest in ground graphs, not the model 
representation. In Section [5j we describe a new representation, the abstract ground graph, 
that can be used to reason about d-separation for relational models. 



3. Background on Propositional Data and Models 

A common assumption in classical statistics, machine learning, and causal discovery is that 
data instances are independent and identically distributed (IID). The first condition assumes 
that the variables on any given data instance are marginally independent of the variables 



6 



Independence in Models of Relational Data 



of any other data instances. The second condition assumes that every data instance is 
drawn from the same underlying joint probability distribution. IID data (also referred to 
as propositional datgj^ are effectively represented as a single table, where rows correspond 
to independent instances and columns are attributes of those instances. 

The directed acyclic graph (DAG), or Bayesian network, is a widely used probabilistic 
graphical model of propositional data (Pearl, 1988). A DAG, G = (V, E), consists of a 



set of vertices V corresponding to random variables in the data and a set of edges E C 
V X V encoding the conditional independencies expected to hold among the variables. Each 
variable y G V is described by a conditional probability distribution P{V \ parents{V)), 
where parents{V) C V \ {V} is the set of parent variables for V. 

If the joint probability distribution -P(V) satisfies the Markov condition for G, then 
P(V) can be compactly factorized according to the conditional distributions, P(V) = 
Y\ve'V I po^ents(y)). The Markov condition states that every variable V G V is 
conditionally independent of its non-descendants given its parents, where the descendants 
of V are all variables reachable by a directed path from V. Deriving the set of conditional 
independencies from G based on the Markov condition is cumbersome, requiring complex 
combinations of probability axioms. Fortunately, d-separation, a set of graphical criteria, 
provides the foundation for algorithmic derivation of all conditional independencies in G 



Markov condition (Neapolitan, 2004) 



(Geiger et al. , 1990) and entails the exact same set of conditional independencies as the 



In the following definition, a path is a sequence of vertices following edges in either 
direction. We say that a variable y is a collider on a path p if the two arrowheads point at 
each other (collide) at V; otherwise, y is a non-collider on p. 

Definition 3.1 (d-separation) Let X, Y, and Z be disjoint sets of variables in directed 
acyclic graph G. A path from some X G X to some 1" G Y is d-connected by Z if and only 
if every collider W on the path, or a descendant of W, is a member of Z, and there are no 
non-colliders in Z. Then, say that X and Y are d-separated by Z if and only if there are 
no (i-connecting paths between X and Y given Z. 



Figure 3(a) depicts the graphical patterns found along paths that lead to d-separation 
or d-connection based on Definition 3.1, and Figure [3(b) [ provides example d-separated and 
c?-connected paths. At a first glance, identifying conditional independence facts from the 
rules of d-separation appears computationally intensive, testing a potentially exponential 
number of paths. However, Geiger et al. (1990) provide a linear-time algorithm based on 



breadth-first search and reachability on G. 

Under a few assumptions, Bayesian networks can be interpreted causally, with edges 
corresponding to direct causal dependencies. If X —t- y is an edge in the causal model 
G, then manipulating or changing the value of X will alter the conditional distribution of 
Y — P{Y I do{X)) using Pearl's do-calculus notation for interventions (Pearl, 2000). The 



causal interpretation of G assumes the causal Markov condition, which is identical to the 
Markov condition, replacing parents with direct causes and non-descendants with non- 
effects. In order for the causal Markov condition to hold, the variables V must also be 

2. IID data are typically referred to as propositional because the data can be equivalently expressed under 
propositional logic. 



7 



Maier, Marazopoulou, and Jensen 



(/-separating path elements 
(exists one on path) 

Q I ^ I 



(/-connecting path elements 
(exists all on path) 




(xW>-® 



® 




(a) Graphical patterns of d-separating and ^-connecting path elements among disjoint sets of variables X 
and Y given Z. Paths for which there exists a non-collider in Z or a collider not in Z are d-separating. 
Paths for which all non-colliders are not in Z and all colliders (or a descendant of colliders) are in Z are 
rf-connecting. 

(/-separated paths (/-connected paths 

(b) Several example d-separated and d-connected paths that illustrates the composition of path elements. 

Figure 3: Patterns of (^-separating and d-connecting path elements and example d- 
separating and (/-connecting paths. 



causally sufficient — there can be no latent common causes for any pair of variables in V. 
The causal Markov condition is also equivalent to (/-separation; therefore, both provide the 
connection between causal structures and probability distributions. 

The conditional independencies entailed by both the causal Markov condition and d- 
separation hold over all faithful distributions that G can represent. A distribution P is 
faithful to G if and only if all conditional independencies in P that are entailed by the 
causal Markov condition on G. Additionally, if P is assumed to be faithful to G, then there 
are algorithms that can learn the Markov, or likelihood, equivalent set of causal models. 
That is, these algorithms correctly identify the causal dependencies in G that are consistent 
with observed conditional independence facts across V. These algorithms typically exploit 



8 



Independence in Models of Relational Data 



the rules of d-separation and the assumption of model acyclicity to determine the direction 



of causality (Spirtes et al. , 2000) 



4. Concepts of Relational Data and Models 

Propositional representations describe domains with a single entity type, but many real- 
world systems involve multiple types of interacting entities with probabilistic dependencies 
that cross the boundaries of entities. The model in Figure 2(a) consists of three such de- 



pendencies (e.g., the competence of employees affect the success of products they develop). 
Many researchers have focused on representing domains characterized by these relational 
properties. In this section, we formally define the concepts of relational data and mod- 
els, providing the basis for the theoretical framework for relational d-separation. We also 
show that the relational representation is strictly more expressive than the propositional 
representation used in Bayesian network modeling. 

A relational schema is a top-level description of what data exist in a particular domain. 



Specifically (adapted from Heckerman et al. , 2007): 



Definition 4.1 (Relational schema) A relational schema S = {£,TZ,A) consists of a 
set of entity classes £ = {Ei, . . . , Em}; relationship classes TZ = {Ri, . . . , Rn}, where each 
Ri = {El, . . . , Ej} with Ej G £; attribute classes A{I) for each item / G £ U TZ; and 
cardinality function card{R, E) = {one, many} for each R & TZ and each E & R. 

A relational schema can be represented graphically with an entity-relationship (ER) 



diagram. We adopt a slightly modified ER diagram using Barker's notation (1990), under 



which entities are rectangular boxes, relationships are diamonds with dashed lines connect- 
ing its associated entities, attributes are ovals residing on entities and relationships, and 
cardinalities are represented with crow's foot notation. 

Example 4.1 The relational schema S for the organization domain example depicted in 
Figure l(a)| consists of entities £ = {Employee, Product, Business-Unit}; relation- 
ships TZ = {Develops, Funds}, where Develops = {Employee, Product}, Funds 
= {Business-Unit, Product} and having cardinalities card(DEVELOPS, Employee) = 
MANY, car(i(DEVELOPS, Product) = many, card(FuNDS, Business-Unit) = one, and 
car(i(FuNDS, Product) = many; and attributes ^(Employee) = {Competence, Salary}, 
^^(Product) = {Success}, and ^(Business-Unit) = {Budget, Revenue}. □ 

Propositional representations describe domains with a single entity class; thus, they pro- 
duce simple schemas with = 1 (one entity class) and |7^| = (no relationship classes). For 
the organization domain example, consider data about only employees {£ = {Employee}). 
Variables would include intrinsic attributes, such as competence and salary, but could also 
include variables describing other related entities, all from the employee perspective (see 
Figure 4(a) for the ER diagram of such a simple schema)]^ That is, we could construct 
a single table for employees that includes columns for the proportion of successfully de- 
veloped products, the total revenue of all business units they work under, etc. Note that 

3. This technique of translating a relational database down to a single, propositional representation is often 



referred to as proposittonalization (Kramer et al. 20011 



9 



Maier, Marazopoulou, and Jensen 




^ ^ompetence^ RVk^^ ^ 

EMPLOYEE 



(a) 




Roger 



(b) 

Figure 4: Relational schema (a) and skeleton (b) for a propositional representation of em- 
ployees from the organization domain example. The skeleton consists of indi- 
vidual, disconnected entity instances. Each RVj is a relational variable from the 
Employee perspective, such as the proportion of products developed successfully 
or the total revenue of involved business units. 



propositionalizing inherently relational data still requires the IID assumption, which is of- 
ten violated since variables of one instance can influence variables of another. For example. 



according to the model in Figure 2(a) the competence of collaborating employees influences 
the success of products, which affects the revenue of business units, which affects its bud- 
get, thereby influencing an employee's salary. As a result, modeling relational data with a 
propositional representation may unnecessarily lose valuable information, especially in the 
context of causal reasoning and accurate estimation of causal effects. 

A relational schema is a template for the underlying relational skeleton (sometimes re- 
ferred to as the data graph), a specific instantiation of entities and relationships. Specifically 
(adapted from Heckerman et al.\ 2007): 



Definition 4.2 (Relational skeleton) A relational skeleton a is an instantiation of entity 
sets (t{E) for each E & £ and relationship sets cr{R) for each R & TZ, adhering to its 
cardinality constraints. Let r G cr{R) where R = {Ei, . . . , Ej} be denoted as r(ei, . . . , ej) 
where G cr(£'j) and Ei £ £. 

Example 4.2 The relational skeleton a for the organization example is depicted in Fig- 
ure 1(b), In this small skeleton, there are entity sets (t(Employee) = {Paul, Quinn, 



10 



Independence in Models of Relational Data 



Roger, Sally, Thomas}, (t(Product) = {Case, Adapter, Laptop, Tablet, Smartphone} , and 
(t(Business-Unit) = {Accessories, Devices}. There are also relationship sets (t(Develops) 
= {develops(Paul, Case), develops(Quinn, Case), . . . , develops (Thomas, Smartphone)} and 
cr(FuNDS) = {fmids(Accessories, Case), funds(Accessories, Adapter), funds(Devices, 
Smartphone)}. The relationship sets adhere to their cardinality constraints (e.g.. Funds is 
a ONE-to-MANY relationship — within cr(FUNDS), every product has a single business unit, 
and every business unit may have multiple products). □ 



The relational skeleton of a propositional representation consists of a set of discon- 
nected entity instances, all drawn from the same entity class. Consequently, the skeleton 
has a simple one-to-one mapping with the representation as a table: Each entity instance 
corresponds to a single row, and each variable is a column. Figure |4(b)| shows a skeleton 
for the organization domain example had it been propositionalized from the employee per- 
spective. Each employee is an entity instance with variables drawn from intrinsic attributes 
and possibly variables describing other related entities. No instances of other entity types 
or relationships exist in the skeleton of a propositional representation. 

In order to specify a model over a relational domain, we must define a space of possi- 
ble variables and dependencies. Consider the example dependency [Product, Develops, 



Employee]. Competence — )• [Product]. Success from the model in Figure 2(a) The van 



ables here include an intrinsic entity attribute (success of a product) and also a variable on 
another entity (the competence of employees that develop a product). For relational data, 
the variable space includes not only intrinsic entity and relationship attributes, but also the 
variables on other entities and relationships that are reachable by paths along the relational 
skeleton. As above, paths in a relational skeleton are instantiations of path templates on a 
relational schema. 



Definition 4.3 (Relational path) A relational path [/i, . . . ,Ik] for relational schema S 
is an alternating sequence of entity and relationship classes Ii, . . . ,Ik € £ L)TZ such that 



for ah j > 1: (1) if Ij G £, 



then 



G TZ and Ij participates in {Ij G 



(2) if 



Ij G TZ, then Ij^i G £ and Ij-i participates in Ij {Ij-i G Ij), and (3) for each ordered 



triple Ij, in [/i, . . . ,Ik], if Ij G TZ, then ^ /j+i and if Ij 

Ij-i 7^ Ij+i or 3/e G Ij~i such that Ij ^ le and card{Ij-i, le) = many. 
base item, or perspective, of the relational path. 



G £, then either 
Ii is called the 



Definition 4.3 generalizes the notion of "slot chains" from the PRM framework (Getoor 



et al. , 2007) by including cardinality constraints and formally describing the semantics 



under which repeated item classes may appear on a path. Conditions (2) and (3) in the 
definition essentially remove any paths that would invariably reach an empty terminal set 
(see Definition 4.4 below) Also, since relational paths may become arbitrarily long, the 



path length is ordinarily limited by a user-specified, domain-specific hop threshold. 



4. The condition Ij £ TZ ^ 7^ Ij+i suggests at first glance that self-relationships (e.g., employees 

manage other employees, individuals in social networks maintain friendships, scholarly articles cite other 
articles) are prohibited. However, a common procedure in ER modeling is to map entity names to 
role names within the self-relationship (e.g., manager/subordinate, friend l/friend2, citing-paper/cited- 
paper), which accords with our definition of relational paths. For simplicity, we omit this additional 
layer of complexity. 



11 



Maier, Marazopoulou, and Jensen 



Example 4.3 From the example relational schema in Figure 1(a), we can derive the set 
of all relational paths up to some hop threshold. Let the hop threshold be /i = 4. Then, 
the set of relational paths from the Employee perspective would include the following: 
[Employee] (employees, hops), [Employee, Develops, Product] (products developed 
by employees, 2 hops), [Employee, Develops, Product, Funds, Business-Unit] (busi- 
ness units of the products developed by employees, 4 hops), and [Employee, Develops, 
Product, Develops, Employee] (other employees developing the same products, 4 hops). 
Invahd relational paths would include [Employee, Develops, Employee] (because Em- 
ployee=Employee and Developsg TZ) and [Business-Unit, Funds, Product, Funds, 
Business-Unit] (because Products £ and card(FuNDS, Business-Unit) = one). □ 

An instantiated relational path produces a set of traversals on a relational skeleton. 
However, the quantity of interest is not the paths themselves, but the set of reachable items 
(i.e., entity or relationship instances): 

Definition 4.4 (Terminal set of a relational path) For any skeleton a and any ii G 
(t(/i), a terminal set P\i^ for relational path P = [/i, . . . , 7^] can be defined inductively as 

[h]W={ii} 

[/i, . . . ,/fc_i,4]|ji = [J {ik I {{ik-i G ik if 4 G 7^) V {ik G ik-i if h G <?)) 

i.-ie[/i,...,4_i]lH A ^ [/i, . . for j = 1 to A: - 1} 

A terminal set of a relational path consists of reachable instances of class I^, the terminal 
item on the path. Conceptually, a terminal set is produced by traversing the skeleton 
beginning at a single base item ii E ^■(A), following instances of the items in the relational 
path, and reaching a target set of Ik instances. The definition implies a "bridge burning" 
semantics under which no instantiated items are revisitedl£] 

Example 4.4 We can generate terminal sets of a relational path by pairing the set of 



relational paths produced by the schema in Figure 1(a) with the relational skeleton in 



Figure 1(b) Let Quinn be our base item instance. Then [Employee] Iquinn = {Quinn}, 
[Employee, Develops, Product] Iquinn = {Case, Adapter, Laptop}, [Employee, Devel- 
ops, Product, Funds, Business-Unit] Iquinn = {Accessories, Devices}, and [Employee, 
Develops, Product, Develops, Employee] Iquinn = {Paul, Roger, Sally}. The bridge 
burning semantics enforces that Quinn is not also included in this last terminal set. □ 

Most relational paths start and end with different item classes. However, there are pairs 
of distinct srelational paths that start and end with the same item classes. For these pairs, 
it is possible that their terminal sets, when originating at the same base item instance, 
will have items in common. The following lemma states that if two relational paths with 
the same base and target items diverge in the middle of the path, then for some relational 



5. The bridge burning semantics yields terminal sets that are necessarily subsets of terminal sets which 
would otherwise be produced without bridge burning. Although this appears to be limiting, it actually 
enables a strictly more expressive class of relational models. See Appendix |B] for more details and an 
example. 



12 



Independence in Models of Relational Data 



skeleton, their terminal sets will have a non-empty intersection. This property is important 
to consider for relational d-separation, and this is the only form for which non-empty 
intersection can occur. 

Lemma 4.1 For any schema S and any two relational paths Pi = [/i, . . . , /m, . . . , I^] and 
P2 = [h, . . . , In, ■ ■ ■ , Ik] with Im ^ In, there exists a skeleton a such that n P2|n / 
for some ii G (^(/i). 

Proof. See Appendix \K\ 

Example 4.5 Let Pathi = [Employee, Develops, Product, Develops, Employee, 
Develops, Product], the terminal sets for which yields other products developed by col- 
laborating employees. Let Path2 = [Employee, Develops, Product, Funds, Business- 
Unit, Funds, Product], for which terminal sets consist of other products funded by 
the business units funding products developed by a given employee. Intersection among 
terminal sets for these paths occurs even in this small skeleton. In fact, the intersection 
of the terminal sets for Pathi and Path2 is non-empty for all employees. For example, 
Paul: Pathi|paui = {Adapter, Laptop} and Path2|paui = {Adapter}; Quinn: Pathilquinn = 
{Tablet} and Path2| Quinn = {Tablet, Smartphone}. □ 

Given the definition for relational paths, it is simple to define relational variables and 
their terminal sets. 

Definition 4.5 (Relational variable) A relational variable [Ii, . . . ,Ik\.V for schema S 
consists of a relational path [/i, . . . , 1^] and an attribute class V G A{Ik)- 



Example 4.6 Relational variables for the relational paths in Example |4. 3| include intrinsic 
variables such as [Employee]. Competence and [Employee]. S'a/ary, and also variables on 
related entities such as [Employee, Develops, Product]. S'ltccess, [Employee, Devel- 
ops, Product, Funds, Business-Unit]. i?e?;enMe, and [Employee, Develops, Prod- 
uct, Develops, Employee]. S'a/ary. □ 

Definition 4.6 (Terminal set of a relational variable) For any skeleton a and ii G 
(t(/i), a terminal set P-V\ii for relational variable P.V = [Ii, . . . ,Ik]-V is the set of variable 
instances {ik-V \ V £ A{ik) A G P\ii}- 

Example 4.7 Terminal sets for base item instance Sally and the relational variables from 



Example 4.6 include [Employee]. Competence [saiiy = {Sally. Competence}, [Employee, 
Develops, Product]. Success [saiiy = {Laptop. Success, Tablet. Success}, [Employee, 
Develops, Product, Funds, Business-Unit]. i?ewentie|saiiy = {Devices. i^eueniie}, and 
[Employee, Develops, Product, Develops, Employee]. S'a/arylsaiiy = {Quinn. S'a/ary, 
Thomas. S'a/ary}. □ 

As a notational convenience, if X is a set of relational variables, all from a common 
perspective Ii, then we say that X|jj for some item ii G o"(/i) is the union of all terminal 
sets, {x \ X G A X G X}. Given the formal definitions for relational variables, we can 
now define relational dependencies. 



13 



Maier, Marazopoulou, and Jensen 



Definition 4.7 (Relational dependency) A relational dependency [Ii, . . . ,Ik]-Vi — t- 
[Ii].V2 consists of two relational variables with a common base item and corresponds to a 
directed probabilistic dependence from [Ii, . . . , Ik]-Vi to [/i].V2. 

Depending on the context, [/i, . . . , Ik]-Vi and [/i].V2 can be referred to as treatment and 



outcome, cause and effect, or parent and child. Without loss of generality, Definition 4/7_ pro- 
vides a canonical specification for dependencies, with the child relational variable restricted 
to singleton paths, thus ensuring that the terminal sets of a child relational variable consist 
of a single value. 



Example 4.8 The dependencies in the relational model displayed in Figure |2(a) can 
be specified as: [Product, Develops, Employee]. Competence — )• [Probvct]. Success 
(product success is influenced by the competence of the employees developing the prod- 
uct), [Employee]. Competence — [Employee]. Salary (an employee's competence affects 
her salary), [Business-Unit, Funds, Product]. Success [Business-Unit]. i?efenwe 
(the success of the products funded by a business unit influences that unit's revenue), 
[Employee, Develops, Product, Funds, Business-Unit]. i?u(i5et—)>[EMPLOYEE].5a/ari/ 
(employee salary is governed by the budget of the business units for which they develop 
products), and [BusiNESS-UNlT].i2euenue — ^ [BusiNESS-UNiT].i?ndg'ei (the revenue of a 
business unit influences its budget). □ 

A relational model is simply a collection of relational dependencies, defined as: 

Definition 4.8 (Relational model) The structure of a relational model A4 = {S,T>) 
consists of a schema S paired with a set of relational dependencies P defined over S. 

A relational model can be represented graphically by superimposing dependencies on 
the ER diagram of a relational schema (see Figure 2(a) for an example). This definition 



of relational models is consistent with and yields structures expressible as DAPER models 



(Heckerman et al. , 2007). These relational models are also equivalent to PRMs, but we gen- 



eralize slot chains as relational paths and provide a formal semantics for their instantiation 
(i.e., terminal sets of relational paths and relational variables). These models are also more 
general than plate models because dependencies can be specified with arbitrary relational 



paths as opposed to simple intersections among plates (Buntine, 1994 Gilks et al. , 1994). 



Relational models only define coherent joint probability distributions if they produce 
acyclic model instantiations (i.e., ground graphs, defined below). A useful construct for 



checking model acyclicity is the class dependency graph (Getoor et al. , 2007), defined as: 



Definition 4.9 (Class dependency graph) The class dependency graph Cm = (^i E) 
for relational model Ai = {S,T)) is a directed graph with nodes for each attribute of each 
item class V = {I.V \ I G £UlZ AUG -^[I)} and edges between pairs of attributes supported 
by relational dependencies in the model E = {Ik-Vi — >• Ij-y2 \ [Ij-, ■ ■ ■ ■, -^fe]-Ui — )• [/j].U2 G V}. 

If the relational dependencies form an acyclic class dependency graph, then every possi- 
ble ground graph of that model is acyclic as well (Getoor et al. , 2007pn All future references 



Getoor et al. 



(20071 additionally define guaranteed acyclic relationships which enable cyclic relational 



dependencies that are guaranteed to produce only acyclic instantiations. The class dependency graphs 
are accordingly "colored," and acyclicity is checked with respect to stratification across colors. We omit 
this type of relationship for simplicity of developing the relational rf-separation theory. 



14 



Independence in Models of Relational Data 




^Competence 



(a) 



; Paul.RV2 ]2> 



Paul.Competence 



Sally.Salary ~~2 >^^ -C ^Sally.RVa 2 I> 

Piul.RVr: :^ I ^C^^allyRvT ^ | 

Sally.Competence (— »< ^Sally.RVk7I > 

"Quinn.RV23> 



g'~Quinn.Competence~~a^ L-»< CrQuinn.RVk^ 



dThomas.Salary]3 
i L 



| [Thomas.RV2^ 



cCThomas.Competencr>^ '~-»<lT^°"''^s.R\7r> 



gr~RoagrComDetence^a^ ^ — ►C lRPger-RvC ^ 



(b) 



Figure 5: Example Bayesian network (a) and ground graph (b) for the propositional rep- 
resentation of employees from the organization domain. The model consists of 
simple dependencies involving employee variables. The ground graph is a set of 
identical copies of the model structure, one for each instance in the data (skele- 
ton). 



to acyclic relational models refer to relational models whose dependency structure forms 
acyclic class dependency graphs. A parameterized relational model contains conditional 
probability distributions for every attribute class A{1^ for each / G £^ U 7^ in order to rep- 
resent a joint probability distribution. Similar to Bayesian networks, the joint distribution 
factorizes according to the conditional distributions given a relational skeleton o as 

P{GGMa) = n n n ^(^-^ l ParentsiLX)) 
lesunxeAii) iea{i) 

where GGmct is the ground graph of the model and skeleton, defined below. Note that 
without a generative model of relational skeletons, these relational models are not truly 
generative as the skeleton must be generated prior to the attributes. However, the same issue 
occurs for Bayesian networks — relational skeletons consist of disconnected entity instances, 
but the number of such instances to create is not described by the model. We choose simple 
processes for generating skeletons, allowing us to focus on relational models of attributes 
and leaving structural causes and effects as future work. 



15 



Maier, Marazopoulou, and Jensen 



A common propositional model is the Bayesian network, as described in Section |3j 
Because all variables are defined for a single entity type and no relationships exist, the 
relational path specification becomes trivial and, hence, implicit. All relational paths, 
relational variables, and relational dependencies are defined from a single perspective with 



singleton paths (e.g., [Employee]). Figure 5(a) depicts an example Bayesian network for 



the schema in Figure 4(a) The entity box surrounding the model is implicit given a single 
entity type. 

Just as the relational schema is a template for skeletons, a relational model can be 
viewed as a template for ground graphs — how dependencies apply to skeletons. 

Definition 4.10 (Ground graph) The ground graph GGmo = (^j E) for relational model 
M = {S, T>) and skeleton cj is a directed graph with nodes V = A{cr) = {i.X \ I G 
Sun A X e A{I) A i G (t(/)} and edges E = {ik-Y ij.X \ ik.Y,ij.X eV A ik-Y 
€[Ij,...,Ik].Y\i^ A [Ij,...,h].Y ^[Ij].X eV}. 

A ground graph is a large directed graph, with a node for every variable of every entity 
and relationship instance in a skeleton, and an edge between pairs of variable instances 
that belong to the terminal sets of the relational variables of all dependencies in a model. 
In fact, given an acyclic relational model, the ground graph has the same semantics as a 



Bayesian network (Getoor, 2001; Heckerman et al. 2007). The ground graph in Figure 2(b) 



is the result of applying the dependencies in the relational model shown in Figure 2(a) 



the skeleton in Figure 1(b) 



to 



The ground graph of a Bayesian network, like the skeleton of a propositional schema, 
has a very regular structure. The ground graph consists of a set of independent identical 
instances of the model structure, one for each instance in the data. Figure |5(b)| shows the 
resulting ground graph of applying the model to the data, or skeleton, of Figure |4(b)[ 

By Lemma 4.1 and Definition 4.10 it is clear that the same canonical dependency 
connects many other relational variables. If the terminal sets involve ik-Y and ij.X, then 
there is an implied dependency between all other relational variables for which i^.Y and 



ij.X are elements. 



Example 4.9 The canonical dependency [Employee]. Compefence—7>[EMPLOYEE]. Salary 
yields the edge Koger. Competence — )• Kogei. Salary in the ground graph of Figure 2(b)] be- 
cause YLogQT .Competence (z [Employee]. Co77ipete7ic6| Roger- 

However, Kogei. Competence 

£ [Employee, Develops, Product, Develops, Employee]. Competence [saiiy (as is 
Kogev. Salary, replacing Competence with Salary). Consequently, the canonical depen- 
dency implies dependence among the variable instances in the terminal sets for [Employee, 
Develops, Product, Develops, Employee]. Competence and [Employee, Develops, 
Product, Develops, Employee]. 5a/ary. □ 

These implied dependencies form the crux of the challenge of identifying independence in 
relational models. Additionally, the intersection between the terminal sets of two relational 
paths is crucial for reasoning about independence because variable instances can belong 
to the terminal sets of multiple relational variables. Since d-separation only guarantees 
independence when there are no (i-connecting paths, we must consider all possible paths 
among variable instances, any one of which may be a member of multiple relational variables. 



16 



Independence in Models of Relational Data 



In Section [5} we define relational c^-separation and provide an appropriate representation, 
the abstract ground graph, that enables straightforward reasoning about (/-separation. 

5. Relational d-separation 

Conditional independence facts are entailed by the rules of d-separation, but only for simple 
directed acyclic graphs. As described in Section |4j every ground graph of a Bayesian 
network consists of a set of identical copies of the model structure. Thus, the implications 
of d-separation on Bayesian networks hold for every ground graph. In contrast, relational 
models are templates for ground graphs that vary with underlying skeletons. That is, the set 
of distributions represented by a relational model is not only parameterized by conditional 
probabilities, but also by the space of valid skeletons given by the schema. Since conditional 
independence facts are only useful when they hold across all possible model instantiations, 
reasoning about (i-separation for relational models is inherently more challenging and leads 
to the following definition: 

Definition 5.1 (Relational rf-separation) Let X, Y, and Z be three distinct sets of 
relational variables for perspective B £ £ U TZ defined over relational schema S. Then, for 
relational model M, X and Y are d-separated by Z if and only if, for any skeleton a, X|fe 
and Y|fe are (/-separated by Z\b in ground graph GGmu for all h G cr{B). 

In other words, for X and Y to be (/-separated by Z for relational model A^, (i-separation 
must hold for all instantiations of those relational variables for any possible skeleton. This is 
a conservative definition, but it is consistent with the semantics of (i-separation on Bayesian 
networks — it guarantees independence, but it does not guarantee dependence. If there exists 
even one valid skeleton and faithful distribution represented by the relational model for 
which X 4^ Y I Z, then X and Y are not (/-separated by Z. 

Given the semantics specified in Definition |5.1[ answering relational (/-separation queries 
is challenging for the following reasons: 

All- ground- graphs semantics: Although it is possible to verify (i-separation on a single 
ground graph of a relational model, the conclusion may not generalize and ground graphs 
can be arbitrarily large. The semantics of (i-separation on Bayesian networks also holds for 
all possible ground graphs. However, the ground graphs of a Bayesian network consist of 
identical copies of the structure of the model, and it is sufficient to reason about (i-separation 
on a single subgraph. 

Relational models are templates: Relational models may appear to be directed acyclic 
graphs, but they are templates for constructing ground graphs. The rules of (i-separation 
do not directly apply to relational models, only to its ground graphs. Applying the rules 
of (i-separation to a relational model frequently leads to incorrect conclusions because of 
unrepresented confounding paths that are only manifest in ground graphs. 

Terminal sets of relational variables may intersect: The terminal sets of two different 



relational variables may have non-empty intersections, as described by Lemma 4.1 Con- 
sequently, there exist non-intuitive implications of dependencies that (/-separation must 
account for, such as the relational (/-connecting paths in the example in Section [2j 

Relational dependency specification: Relational models are defined with canonical de- 
pendencies, each specified from a single perspective. However, variables in a ground graph 



17 



Maier, Marazopoulou, and Jensen 



may contribute to the terminal sets of multiple relational variables, each defined from dif- 
ferent perspectives. Thus, we need methods to translate and extend canonically specified 
dependencies to produce the implied dependencies between arbitrary relational variables, 
such as the implied dependency described in Example |4.9[ 

5.1 Abstracting over All Ground Graphs 

The definition of relational d-separation and its challenges suggest a solution that abstracts 
over all possible ground graphs and explicitly represents the potential intersection between 
the terminal sets of pairs of relational variables. We introduce a new lifted representation, 
called the abstract ground graph, that captures all dependencies among arbitrary relational 
variables for any ground graph, using knowledge of only the schema and the model. 

Definition 5.2 (Abstract ground graph) An abstract ground graph AGGMBh = (^^E) 
for relational model Ai = perspective B £ S U TZ, and hop threshold /i G N*^ is an 

abstraction of the dependencies P for all possible ground graphs GGm^ of on arbitrary 
skeletons cr. 

The set of nodes in AGGMBh-, V = RVUIV , is the union of all relational variables RV = 
[\^B, . . . , Ij\.V I length{[B , . . . , Ij]) < /i+l} and the intersections between pairs of relational 
variables that could intersect IV = {XnY \ X,Y £ RV A X = [B, . . . , 1^, . . . , Ij].V A Y = 
[B,...,Ii,...,Ij].V A h^h). 

The set of edges in AGGmbh is E = RVE U IVE, where RVE C RV x RV and 
IV E C IV X RV U RV X IV. RVE is the set of edges between pairs of relational variables: 
RVE = {[B,..., Ik].Vi ^ [B,..., Ij].V2 \{Ij,..., Ik]-Vi ^ [Ij]-V2 A [i?, . . . , 4] e 

extend{[B, . . .,/,], [Ij, . . . ,4])}. 

IVE is the set of edges inherited from both relational variable sources of every inter- 
section variable: IVE = {X ^ [B, . . . ,Ij].V2 \ X = Pi.Vi D P2.V1 e IV A {Pi.Vl 
[B,..., Ij].V2 e RVE V P2.V1 ^ . . . , Ij].V2 e RVE)} U {[B,...,Ij].Vi ^ X \ X = 
P1.V2nP2.V2 €IV A {[B, . . .,Ij].Vi Pi.Vl G RVE V [B,.. .,Ij].Vi P2.V1 £ RVE)}. 



The extend method is described in Definition 5.3 below. Essentially, the construction of 
an abstract ground graph for relational model Ai, perspective B £ SUTZ, and hop threshold 
h follows three simple steps: (1) Add a node for all relational variables, with relational path 
length limited by h. (2) Insert edges for the direct causes of every relational variable. (3) 
For each pair of potentially intersecting relational variables, add a new "intersection" node 
that inherits the direct causes and effects from both of its sources. Then, to answer queries 
of the form "Are X and Y d-separated by Z?", simply (1) augment X, Y, and Z with their 
corresponding intersection variables and (2) apply the rules of d-separation on the abstract 
ground graph for the common perspective of X, Y, and Z. 

Example 5.1 Figure [6] shows the abstract ground graph AGGm Employee 6 for the organi- 
zation example from the Employee perspective with hop threshold /i = 6j^ As in Section [2| 
we derive an appropriate conditioning set Z in order to d-separate individual employee com- 
petence (X = {[Employee]. Competence}) from the revenue of the employee's funding busi- 
ness units (Y = {[Employee, Develops, Product, Funds, Business-Unit]. fieuenue}). 

7. The variables Salary and Budget are removed for simplicity. They are irrelevant for this rf-separation 
example as they are solely effects of other variables. 



18 



Independence in Models of Relational Data 



[EMPLOYEE, DEVELOPS, PRODUCT, FUNDS, BUSINESS-UNIT, FUNDS, PRODUCT] .Success 



(TemPLOYEE] .Competence^ ►(^[IemPLOYEE, DEVELOPS, PRODUCT].Succesr]) ►<^||^]jiMPLOYEE, DEVELOPS, PRODUCT, FUNDS, BUSINESS-UNItl-ffei^eni/e^ 



7 



^EMPLOYEE, DEVELOPS, PRODUCT, DEVELOPS, EMPLOYEE]. Compe(eiTce 



z 



EMPLOYEE, DEVELOPS, PRODUCT, DEVELOPS, EMPLOYEE, DEVELOPS, PRO DUCT]. Success 

n 

EMPLOYEE, DEVELOPS, PRODUCT, FUNDS, BUSINESS-UNIT, FUNDS, PRODUCTJ.Success 



[EMPLOYEE, DEVELOPS, PRODUCT, DEVELOPS, EMPLOYEE, DEVELOPS, PRODUCT].Soccess 



Figure 6: The abstract ground graph for the organization domain model in Figure 2(a) 
from the Employee perspective with hop threshold h = 6 (with the variables for 
Salary and Budget omitted for simplicity). This abstract ground graph includes 
one intersection node. 



Applying the rules of (f-separation to the abstract ground graph, we see that it is necessary 
to condition on both product success ([Employee, Develops, Product]. Success) and 
the competence of other employees developing the same products ([Employee, Develops, 
Product, Develops, Employee]. Competence). For h = 6, augmenting X, Y, and Z with 
their corresponding intersection variables does not result in any changes. For h = 8, the ab- 
stract ground graph includes a node for relational variable [Employee, Develops, Prod- 
uct, Develops, Employee, Develops, Product, Funds, Business-Unit]. .Revenue 
(the revenue of the business units funding the other products of collaborating employees) 



which, by Lemma 4.1, could have a non-empty intersection with [Employee, Develops, 
Product, Funds, Business-Unit]. i?euenne. Therefore, Y would also include the inter- 
section with this other relational variable. However, for this query, the conditioning set Z 
for h = 6 happens to also d-separate at /i = 8 (and any h £ N^). □ 



Using the algorithm devised by Geiger et al. (1990), relational c?-separation queries 



can be answered in 0(|ii^|) time with respect to the number of edges in the abstract 
ground graph. In practice, the size of an abstract ground graph depends on the relational 
schema and model (i.e., the number of entities, relationships, cardinalities, attributes, and 



dependencies — see the experiment in Section 7.1), as well as the hop threshold limiting the 



length of relational paths. For the example in Figure [6| the abstract ground graph has 7 
nodes and 7 edges (including 1 intersection node with 2 edges); for /i = 8, it would have 13 
nodes and 21 edges (including 4 intersection nodes with 13 edges). Abstract ground graphs 
are invariant to the size of ground graphs, even though ground graphs can be arbitrarily 
large — that is, relational databases have no maximum size. 

Next, we formally define the extend method, used internally for the construction of 
abstract ground graphs. This method translates dependencies specified in the model into 
dependencies in the abstract ground graph. 

Definition 5.3 (Extending relational paths) Let Porig = [hT--,Ij] and let Pext = 
[Ij, . . . , Ik] be two relational paths for schema S. The following three functions extend Porig 



19 



Maier, Marazopoulou, and Jensen 



with Pext- 



extend{Porig, Pext) = {P = concat{Porig[0 : length{Porig) - i + l],Pext[i ■ length{Pext)]) \ 

i £ pivots{reverse{Porig), Pext) A isValid{P)^ 

pivots{Pi,P2) = {i \ Pi[0 : i] = P2[0 : i]} 

. f True if P does not violate Definition l43l 

isV aliaiP) = < ^ , , , 

[ liaise otherwise 

where concat, length, reverse, and inclusive-exclusive sublist are standard list functions. 

Informally, the extend method constructs a set of valid relational paths from two input 
relational paths. It first finds the indices of all items for which the input paths have a 
common starting subpath (called pivots), and then it concatenates the two paths at each 



pivot, removing one of the duplicated subpaths (see Example 5.2). Since (Z-separation re- 
quires blocking all paths of dependence between two sets of variables, the extend method 
is critical to ensure the soundness and completeness of our approach. The abstract ground 
graph must capture all paths of dependence among the variable instances in the terminal 
sets of relational variables for any represented ground graph. However, relational models 
are specified solely with respect to canonical dependencies; the extend method is called re- 
peatedly during the creation of an abstract ground graph, with Porg set to a given relational 
path and Pext drawn from a canonical dependency. 

Example 5.2 During the construction of the abstract ground graph j4GG_a/( employee 6 

de- 
picted in Figure [6j the extend method is called several times. First, all relational variables 
for h = 6 from the Employee perspective are added as nodes in the graph. Next, extend is 
used to insert edges corresponding to direct causes. Consider the node for [Employee, De- 
velops, Product]. Success. The construction of j4GG^ employee 6 calls extend{Porig , Pext) 
with Porig = [Employee, Develops, Product] and Pext = [Product, Develops, Em- 
ployee] because [Product, Develops, EMPLOYEE].Competence—^ [Product]. Success G 
V. Here, extend{Porig , Pext) = {[Employee], [Employee, Develops, Product, Devel- 
ops, Employee]}, which leads to the insertion of two edges in the abstract ground graph. 
Note that pivots{r ever se{Porig), Pext) = {1,2,3}, and the pivot at i = 2 yields the invalid 
relational path [Employee, Develops, Employee]. □ 

We also describe two important properties of the extend method with the following two 
lemmas. The first lemma states that every relational path produced by the extend method 
yields a terminal set for some skeleton such that there is an item instance also reachable by 
the two original paths. This lemma is useful for proving the soundness of our abstraction — 
that all edges inserted in an abstract ground graph correspond to edges in some ground 
graph. 

Lemma 5.1 For any two relational paths Porig = [Ii, ■ ■ ■ , Ij] o-nd P^xt = [Ij, ■ ■ ■ ,Ik] with 
P = extend{Porig, Pext) , VP e P there exists a relational skeleton a such that 3ii G (t(/i) 
such that 3ik G P\i-^ and 3ij G Poriglh such that i^ G Pext\ij- 



20 



Independence in Models of Relational Data 



Proof. See Appendix [K] 



Example 5.3 Let a be the skeleton shown in Figure 1(b), let Porig = [Employee, 



Develops, Product], let Pext = [Product, Develops, Employee], and let ii 



Sally G (t(Employee). From Example 5.2, we know that P = extend{Parig, Pext) 



{[Employee], [Employee, Develops, Product, Develops, Employee]}. We also 
have [Employee] I Sally = {Sally} and [Employee, Develops, Product, Develops, Em- 
ployee] [saiiy = {Quinn, Roger, Thomas}. By Lemma 5.1 , there should exist an ij £ Poriglh 



for each of these employees (Quinn, Roger, Sally, and Thomas) such that they are in the ter- 
minal set Pextliy We have Poriglsaiiy = {Laptop, Tablet}, and PextlLaptop = {Quinn, Roger, 
Sally} and -Pext I Tablet = {Sally, Thomas}. So, the lemma clearly holds for this example. □ 



Lemma 5.1 guarantees that, for some relational skeleton, items in the terminal sets 
produced by extend are also reachable in combination of the two input paths, Porig and 
Pext- It is also possible (although infrequent) that there exist items reachable by Porig 
and Pext that are not in the terminal set of any path produced with extend{Porig , Pext) ■ 
The following lemma describes this unreachable set of items, stating that there must exist 
an alternative relational path P^^^g that intersects with Porig that, when using extend, 
catches those remaining items. This lemma is important for proving the completeness of 
our abstraction — that all edges in any ground graph are represented in the abstract ground 
graph. 

Lemma 5.2 For any relational skeleton a and two relational paths Porig = [A, • ■ • , Ij] o,nd 
Pext = [/,-,..., 4] with P = extend{Porig,Pext), Vii G cr(/i) yij £ Porig\i^ yik G Pextlij if 
yP £ P ik ^ -P|ii, then ^P'orig such that PorigW ^ P'origW / ^ '^''^d ik G -P'lii for some 
P' G extend{P'^-g,Pext)- 

Proof. See Appendix \K\ 

Example 5.4 Although Lemma |5.2| does not apply to the organization domain as cur- 
rently represented, it could apply if either (1) there were cycles in the relational schema 
or (2) if the path specifications on the relational dependencies included a cycle. Consider 
additional relationships between employees and products. If employees could be involved 
with products at various stages (e.g., research, development, testing, marketing), then there 
would be alternative relational paths for which the lemma might apply. The proof of the 
lemma in Appendix [A| provides abstract conditions describing when the lemma applies. □ 



5.2 Proof of Correctness 

The correctness of our approach to relational d-separation relies on several facts: (1) d- 
separation is valid for directed acyclic graphs (DAGs); (2) ground graphs are DAGs; and 
(3) abstract ground graphs are directed acyclic graphs that represent exactly the edges that 
could appear in all possible ground graphs. It follows that d-separation on abstract ground 
graphs, augmented by intersection variables, is sound and complete for all ground graphs. 
Additionally, we show that since relational d-separation is sound and complete, it is also 
equivalent to the Markov condition for relational models. Using the previous definitions 
and lemmas, the following sequence of results proves the correctness of our approach to 
identifying independence in relational models. 



21 



Maier, Marazopoulou, and Jensen 



Theorem 5.1 The rules of d- separation are sound and complete for directed acyclic graphs. 



Proof. Due to Verma and Pearl (1988) for soundness and Geiger and Pearl (1988) for 
completeness. ■ 



Theorem 5.1 implies that (1) all conditional independence facts derived by d-separation 



on a Bayesian network hold in any faithful distribution represented by that model (sound- 
ness) and (2) all conditional independence facts that hold in a faithful distribution can 
be inferred from d-separation applied to the Bayesian network encoding that distribution 
(completeness). 

Lemma 5.3 For any acyclic relational model Ai and skeleton a, the ground graph GGmu 
is a directed acyclic graph. 



Proof. Due to both [Heckerman et al.| ( |2007[ ) for DAPER models and |Getoor| ( |200l[ ) for 
PRMs. ■ 



Lemma |5.3| states that any ground graph of an acyclic relational model is essentially a 
very large Bayesian network. By Theorem 5.1, d-separation is sound and complete when 
applied directly to a ground graph. However, Definition 5.1 explicitly states that rela- 
tional d-separation must hold across all possible ground graphs, which is the reason for 
constructing the abstract ground graph representation. Next, we introduce the notion of 
(B, /i)-reachability, which describes the conditions under which we can expect an edge in a 
ground graph to be represented in an abstract ground graph. 

Definition 5.4 {{B, /i)-reachability) Let GGMa be the ground graph for some relational 
model Ai and skeleton a. Then, ik.Vi — )• ij.V2 G GGMa is {B, h)-reachahle for perspective 
B and hop threshold h if there exist relational variables Pk-Vi = [i?, . . . , and Pj.V2 = 

[i?, . . . , /j].V2 such that length[Pk) < /i + 1, length{Pj) < /i + 1, and there exists a b £ a{B) 
where ik G Pk\b and ij G -Pj lb- 



Example 5.5 Consider the ground graph shown in Figure 2(b) Let perspective B be Em- 
ployee, and let the hop threshold h = 6. Let ik-Vi — )• ij.V2 be the edge Laptop. S'uccess — )• 
Devices. -Refenue in the ground graph. Then, we can show that this edge is {B, /i)-reachable 
because of the following: Set Pk-Vi = [Employee, Develops, Product]. Success, Pj.V2 = 
[Employee, Develops, Product, Funds, Business-Unit]. iZeuenne, and let b = Sally 
G (t(Employee). We have length{Pk) = 3 < 7, length(Pj) = 5 < 7, LaptopG Pfclsaiiy, and 
DevicesG Pjisaiiy □ 



Since Definition 5.4 pertains to edges reachable via a particular perspective B and hop 
threshold h, it relates to the reachability of edges in abstract ground graphs. Specifically, 

-jcr, we can derive a set 



Definition 5.4 implies that (1) for any edge in ground graph GGmc 
of abstract ground graphs for which that edge is {B, /i)-reachable, and (2) for any abstract 
ground graph AGGmbk-, we can derive the set of (B, /i)-reachable edges for a given ground 
graph. Given (5, /i)-reachability, we can now express the soundness and completeness of 
abstract ground graphs, which is depicted graphically in Figure [7j 



22 



Independence in Models of Relational Data 



{B, /i)-soundness 

AGGMBh ^ * GGMa 

{B, /i)-completeness 

Figure 7: A relational model A4 combined with a perspective B and a hop threshold h 
yields the abstract ground graph AGGj^Bh- The model combines with a skeleton 



a to produce the ground graph GGmct- In Theorem 5.2 we show that abstract 



ground graphs are both sound and complete with respect to {B, /i)-reachability. 



Theorem 5.2 For any acyclic relational model M, perspective B € £yjTZ, and hop thresh- 
old h G N*^, the abstract ground graph AGGj^Bh is [B, h)-reachahly sound and complete for 
any ground graph GGjUa for all skeletons a. 

Proof. See Appendix [A) 

Theorem |5.2| guarantees that, up to the hop threshold for a given perspective, abstract 
ground graphs capture all possible paths of dependence between any pair of variables in 
any ground graph. The details of the proof provide the reasons why explicitly representing 
the intersection between pairs of relational variables is necessary for ensuring a sound and 
complete abstraction. 

Theorem 5.3 For any acyclic relational model M, perspective B ^ £\JTZ, and hop thresh- 
old /i G N'^, the abstract ground graph AGGj^iBh is directed and acyclic. 



Proof. Let M be an arbitrary acyclic relational model, let G i5 U 7^ be an arbitrary 
perspective, and let /i G N'^ be an arbitrary hop threshold. It is clear that every edge 



in the abstract ground graph AGGMBh is directed by construction in Definition 5.2 All 
edges inserted in any abstract ground graph are drawn from the directed dependencies in 
a relational model. Since Ai is acyclic, they class dependency graph Gm is also acyclic by 



Definition 4.9, Assume for contradiction that there exists a cycle of length n in AGGMBh 



that contains both relational variables and intersection variables. By Definition 5.2 , all edges 
inserted in AGGmbh are drawn from some dependency in A4, even for nodes corresponding 
to intersection variables. Retaining only the final item class in each relational path for every 



node in the cycle will yield a cycle in G^ by Definition 4.9 So, A4 could not have been 
acyclic, which contradicts the assumption. ■ 



Theorem |5.3| ensures that the standard rules of c?-separation can apply directly to ab- 
stract ground graphs because they are acyclic given an acyclic model. We now have suf- 
ficient supporting theory to prove that d-separation on abstract ground graphs is sound 
and complete. In the following theorem, we define W as the set of nodes augmented 
with their corresponding intersection nodes for the set of relational variables W: W = 
W U UvKewi^ CiW I W nW is an intersection node in AGGMBh}- We also say that 
d-separation holds up to a specified hop threshold h if there are no d-connecting paths 
involving a relational path (for some relational variable) of length greater than /i + 1. 



23 



Maier, Marazopoulou, and Jensen 



Theorem 5.4 Relational d-separation is sound and complete for abstract ground graphs up 
to a specified hop threshold. Let X, Y, and Z be three distinct sets of relational variables 
for perspective B £ £ U TZ defined over relational schema S. Then, for any skeleton a and 
for all b E (t{B), X|ft and Yj^ are d-separated by Z|b up to hop threshold h in ground graph 
GGjiia if o-nd only ifX. and Y are d-separated by Z on the abstract ground graph AGGmbh- 



Proof. We must show that d-separation on an abstract ground graph imphes d-separation 
on all ground graphs it represents (soundness) and that d-separation facts that hold across 
all ground graphs are also entailed by d-separation on the abstract ground graph (com- 
pleteness). Let M be an arbitrary acyclic relational model, and let X, Y, and Z be three 
arbitrary distinct sets of relational variables for perspective B G £\JTZ defined over relational 
schema S. 

Soundness : Assume that X and Y are d-separated by Z on AGG^mbh for some hop 
threshold h G N''. Assume for contradiction that there exists an item b G o"(-B) such that 
X|f, and Ylft are not d-separated by Z|{, in the ground graph GGm<t foi' some arbitrary 
skeleton a. Then, there must exist a d-connecting path p from some x G X|{| to some 
y G 'Y\h given all z G 'L\h- By Theorem 5.2, AGGj^Bh is (S, /i)-reachably complete, so all 
(-B, /i)-reachable edges in GGMa are captured by edges in AGGmbh- So, path p must be 
represented from all nodes in {n | x G to all nodes in {n | y G n\b} in AGGmbh- If 
p is d-connecting in GG^a, then it is d-connecting in AGGj^sh: implying that X and Y 
are not d-separated by Z. So, X|;, and Y|f, must be d-separated by Z\b. 

Completeness: Assume that X|fe and Y|f, are d-separated by Z|fe in the ground graph 
GGMa for all skeletons a for all b G cr{B). Assume for contradiction that X and Y are 
not d-separated by Z on AGG^iBh for some hop threshold h G N". Then, there must 
exist a d-connecting path p for some relational variable AT G X to some y G Y given all 



Z G Z. By Theorem 5.2, AGGjuBh is (S, /i)-reachably sound, so every edge in AGG_MBh 
must correspond to some pair of variable instances in some ground graph. So, if p is d- 
connecting in AGGj^sh^ then there must exist some skeleton a such that p is d-connecting 
in GGmct for some b G cr{B), implying that d-separation does not hold for that ground 
graph. So, X and Y must be d-separated by Z on AGGmbh- * 



Theorem 5.4 proves that d-separation on abstract ground graphs is a sound and complete 



solution to identifying independence in relational models. Given Theorem 5.1, we can also 



say that the set of conditional independence facts derived from d-separation on abstract 
ground graphs is exactly the same (up to a specified hop threshold) as the set of conditional 
independencies in common with all faithful distributions represented by all possible ground 
graphs. 

Additionally, we can show that relational d-separation is equivalent to the Markov 
condition on relational models. In the following definition, let nd{X) be the non-descendant 
variables of X and let pa{X) be the set of parent variables of X . 



Definition 5.5 (Relational Markov condition) Let X = [B, . . . ,Ik\.V be a relational 
variable for perspective B £ £ U TZ defined over relational schema S. Then, for relational 
model Ai, P{X \nd{X),pa{X)) = P{X \ pa{X)) if and only if, for any skeleton a, Vx G 
A|fe P{x I nd{x),pa{x)) = P{x \ pa{x)) in ground graph GGm<t for all b G cr{B). 



24 



Independence in Models of Relational Data 



In other words, a relational variable X is independent of its non-descendants given its 
parents if and only if, for any possible ground graph, the Markov condition holds for all 
instances of X. For Bayesian networks, the Markov condition is equivalent to d-separation 



(Neapolitan, 2004). Because ground graphs are Bayesian networks (by Lemma 5.3) and rela- 



tional (i-separation on abstract ground graphs is sound and complete (by Theorem 5.4), we 
can conclude that relational d-separation is equivalent to the relational Markov condition. 



6. NaiVe Relational d-separation is Frequently Incorrect 

If the rules of ^-separation for Bayesian networks were applied directly to relational models, 
how frequently would the conditional independence conclusions be correct? In this section, 
we evaluate the necessity of our approach to relational d-separation, namely the use of 
abstract ground graphs. We empirically test the consistency of a naive approach to relational 
d-separation against our sound and complete solution over a large space of synthetic causal 
models. In order to do so, we first formally define how d-separation could actually be applied 
directly to the structure of a relational model. Consider the following limited definition 
of relational paths, which itself limits the space of models and conditional independence 
queries. 

Definition 6.1 (Simple relational path) A simple relational path [/i,...,/^] for rela- 
tional schema S is an alternating sequence of entity and relationship classes /i, . . . , G £\JTZ 
such that Ii ^ ■ ■ ■ ^ Ik and for all j > 1 (1) if Ij S £, then Ij^i G TZ with Ij € Ij-i and (2) 
if Ij € IZ, then G £ with S Ij. 



The sole difference between relational paths (Definition 4.3) and simple relational paths 
is that no item class may appear multiple times along a simple relational path. This yields 



paths drawn directly from a schema diagram. For the example in Figure 1(a), [Employee, 



Develops, Product] is simple whereas [Employee, Develops, Product, Develops, 
Employee] is not. 

Additionally, we define simple relational schemas such that for any two item classes 
Ij,Ik £ U TZ, there exists at most one simple relational path between them, that is, no 
cycles occur in the schema diagram. The example in Figure 1(a) is a simple relational 
schema. The restriction to simple relational paths and schemas yields similar definitions 
for simple relational variables, simple relational dependencies, and simple relational models. 



The example in Figure 2(a) is a simple relational model because it consists of only simple 



relational dependencies. 

A first approximation to relational d-separation would be to apply the rules of d- 
separation for Bayesian networks directly to the graphical representation of relational mod- 
els. This is equivalent to constructing the class dependency graph of Definition 4.9 from 



the relational model and applying d-separation. The class dependency graph for the model 
in Figure 2(a) is shown in Figure 8(a) Note that the class dependency graph removes 



perspectives, ignores path designators on dependencies, does not include all implications 
of dependencies among arbitrary relational variables, and does not consider intersection 
variables. 

Although the class dependency graph is independent of perspectives, testing any condi- 
tional independence fact requires choosing a perspective. All variables must have a common 



25 



Maier, Marazopoulou, and Jensen 



([ employee .Compefence )— —►( ^ PRODUCT.Success J )— — ►( juSINESS-UNIT.Kerenue ) 
EMPLOYEE. Salary J)^ (^BUSINESS-UNIT.fiudgeT) 



^MPLOYEEl.Compefenre )— ►(^E MPLOYEE, DEVELOPS, PRODUCTl.Succe S^)— ►Cj ^PLOYEE, DEVELOPS, PRODUCT, FUNDS, BUSINESS-UNIT]. Keren ^^ 

^ I ^ J" I _ 

[EMPLOYEE], Sa/aryJ) 4 CJEMPLOYEE, DEVELOPS, PRODUCT, FUNDS, BUSINESS-UNIT]. BudgeT^ 



(; ^ODUCT, DEVELOPS, EMPLOYEE]. Competen ce^)— ► ^[PRODUCT].Sncces 7)——^<|P RODUCT, FUNDS, BUSINESS-UNIT]. gei/enu ?) 

^ I ^ ^ T ^ 

^~lPRODUCT, DEVELOPS, EMPLOYEE].5a/ary~^ < ([jPRODUCT, FUNDS, BUSINESS-UN]T].Budg^^ 



<JS USINESS-UNIT, FUNDS, PRODUCT, DEVELOPS, EMPLOYEE]. Compefe nce>——^<[[B USINESS-UNIT, FUNDS, PRODUCT], SuccS ?;)——^( lBUSINESS-UNIT].«erenue ) 
C^[] jBUSINESS-UNIT, FUNDS, PRODUCT, DEVELOPS, EMPLOYEEI.Satary ]^ < ^ [BUSINESS-UNIT].Budgef^ 

(b) 



Figure 8: For the model in Figure 2(a), (a) the class dependency graph and (b) three 
simple abstract ground graphs for the Employee, Product, and Business- 
Unit perspectives. 



base item; otherwise, no method can produce a single consistent, propositional table from 
a relational database. For example, consider the construction of a table describing employ- 
ees with columns for their salary, the success of products they develop, and the revenue 
of the business units they operate under. This procedure requires a join across three re- 
lational variables ( [Employee]. S'a/ary, [Employee, Develops, Product]. Success, and 
[Employee, Develops, Product, Funds, Business-Unit]. iJeuenue) for every common 
base item instance, from Paul to Thomas. Therefore, given a perspective and the space of 
simple relational schemas and models, a class dependency graph is equivalent to a simple 
abstract ground graph. 

Definition 6.2 (Simple abstract ground graph) For simple relational model M = 
{S,V), perspective B £ £ UTZ, and hop threshold h e N*^, the simple abstract ground graph 
AGGji^^^ is the directed acyclic graph {V, E) that abstracts the dependencies V among 
simple relational variables. The set of nodes is the union of all simple relational variables 
[[B,...,Ij\.V \ B ^ ■■■ ^ Ij A length{[B, . . . , Ij]) < h + l} , and the set of ed ges connects 
simple relational variables {[B,..., Ik\-Vi [B,. .., Ij]-V2 \ [Ij,-- - , h]-Vi [-/j]-V2 eV A 
[B,...,Ik] e extend{[B,...,I,],[Ij,...,h]) A [B, . . . , Ik]-Vi,[B, . . . , Ij].V2 G V}. 

Simple abstract ground graphs only include nodes for simple relational variables and 



necessarily exclude intersection variables. Lemma 4.1 only applies to pairs of simple rela- 
tional paths if the schema has cycles, which is not the case for simple relational schemas. 
As a result, for all possible perspectives of a given schema and model, all simple abstract 



26 



Independence in Models of Relational Data 



ground graphs consist of the same number of nodes and edges when the hop threshold is 
sufficiently large and with the path designator on relational variables redefined from the 
given perspective. Figure |8(b)| shows three simple abstract ground graphs from unique 



perspectives for the model in Figure |2(a) As noted above, simple abstract ground graphs 
are qualitatively the same as the class dependency graph, but this representation enables 
answering relational ^-separation queries, which requires a common perspective. 



Remark 6.1 The nai've approach to relational d-separation derives conditional indepen- 
dence facts from simple abstract ground graphs (Definition 6.2). The sound and complete 
approach described in this paper applies (f-separation, augmenting the variable sets with 
their intersection variables, to "regular" abstract ground graphs, as described by Defini- 



tion 5.2 Clearly, if (i-separation on a simple abstract ground graph claims that X is 
d-separated from Y given Z, then d-separation on the regular abstract ground graph yields 
the same conclusion if and only if there are no d-connecting paths on the regular abstract 
ground graph. Either X and Y can be d-separated by a set of simple relational variables Z, 
or they require non-simple relational variables — those involving relational paths with cycles 
or intersection variables |£] □ 



To evaluate the necessity of regular abstract ground graphs (i.e., the additional paths in- 
volving non-simple relational variables and intersection variables) , we devised an experiment 
to determine the frequency of equivalence of d-separation on simple and regular abstract 
ground graphs. The two approaches are only equivalent if a minimal separating set consists 
entirely of simple relational variables. Thus, for a given pair of relational variables X and 
y, we test the following: 

1. Is either X or Y & non-simple relational variable? 

2. Are X and Y marginally independent? 

3. Does a minimal separating set Z d-separate X and y, where Z consists solely of 
simple relational variables? 

4. Is there no separating set Z that d-separates X and Yl 

If the answer to (1) is yes, then the naive approach cannot apply since either X or y 
is not defined on the simple abstract ground graph. If the answer to (2) is yes, then there 
is equivalence; this is a trivial case because there are no d-connecting paths for Z = 0. If 
the answer to (3) is yes, then there is a minimal separating set Z consisting of only simple 
relational variables. In this case, the simple abstract ground graph is sufficient, and we 
have equivalence. If the answer to (4) is yes, then no separating set Z, simple or otherwise, 
renders X and Y conditionally independent. 

We randomly generated a schema and model for 100 trials for each setting using the 
following parameters: 

8. The theoretical conditions under which equivalence occurs are sufficiently complex that they provide 
little utility as they essentially require reconstructing the regular abstract ground graph and checking a 
potentially exponential number of dependency paths. 



27 



Maier, Marazopoulou, and Jensen 



• Number of entity classes, ranging from 1 to 4. 

• Number of relationship classes fixed to one less than the number of entities, allowing 
only simple relational schemas. Relationship cardinalities are randomly selected. 

• Number of attributes for each entity and relationship class randomly drawn from a 
shifted Poisson distribution with mean A = 1.0 (~ Pois(l.O) + 1). 

• Number of dependencies in the model, ranging from 1 to 10. 

Then, for all pairs of relational variables with a common perspective for hop threshold 
/i = 4, we ran the aforementioned tests against the abstract ground graph with hop threshold 
h = 8. 

This procedure generated a total of almost 3.6 million pairs of relational variables to test. 
Approximately 56% included a non-simple relational variable; the nai've approach cannot 
be used to derive a conditional independence statement in these cases, requiring the full 
abstract ground graph in order to represent these variables. Of the remaining 44% (roughly 
1.6 million), 82% were marginally independent, and 9% were not conditionally independent 
given any conditioning set Z. Then, of the remaining 9% (roughly 145,000), we can test 
the frequency of equivalence for conditional independence facts with non-empty separating 
sets — the percentage of cases for which only simple relational variables are required in a 
minimal separating set Z. 

Figure [9] shows this frequency for schemas of increasing numbers of entity classes (1-4) for 
varying numbers of dependencies in the causal model (1-10). Since relational schemas with 
a single entity class and no relationships describe propositional data, the simple abstract 
ground graph is equivalent to the full abstract ground graph, which is also equivalent to the 
model itself. In this case, the nai've approach is always equivalent because it is exactly d- 
separation on Bayesian networks. For truly relational schemas (with more than one entity 
class and at least one relationship class), the frequency of equivalence decreases as the 
number of dependencies in the model increases. Additionally, the frequency of equivalence 
decreases more as the number of entities in the schema increases. For example, the frequency 
of equivalence for nine dependencies is 60.3% for two entities, 51.2% for three entities, and 
43.2% for four entities. 

There are several factors that influence the frequency of equivalence. We learned a linear 
regression model with interaction terms and indicator variables for the number of entities, 
dependencies, and many cardinalities on relationships. In predicting the raw number of 
non-equivalent conditional independencies (those requiring non-simple relational variables 
in the separating set), the strongest predictor is the interaction between the number of 
many cardinalities and the number of dependencies. This suggests that the more depen- 
dencies in the model and the more many cardinalities, the more non-equivalent conditional 
independencies will occur. Thus, the frequency of equivalence illustrated in Figure [9] is 
only so high because the experiment averages across randomly chosen schemas with ran- 
domly selected relationship cardinalities. When there are few many cardinalities, the model 
can be more closely approximated by a Bayesian network, for which we expect the naive 
approach to be more equivalent. In contrast, in predicting the raw number of equivalent 
conditional independencies (those with only simple relational variables in the separating 



28 



Independence in Models of Relational Data 




Figure 9: Of all generated relational (^-separation queries, 56% are not representable with 
the naive approach. Of the 44% that are representable (involving only simple 
relational variables), 82% are marginally independent and 9% are dependent. 
Pairs of relational variables in the remaining 9% are conditionally independent 
given a non-empty separating set [X _LL y | Z, where Z 7^ 0). We test whether the 
conditioning set consists solely of simple relational variables. If so, then the naive 
approach to relational d-separation is equivalent to d-separation on fully specified 
abstract ground graphs. This graph plots to the frequency of equivalence across 
schcmas with increasing numbers of entity classes (1-4) for varying numbers of 
dependencies (1-10). For schemas with more than one entity class, the frequency 
of equivalence decreases as the number of dependencies increases. Shown with 
95% confidence intervals. 



set), the strongest predictors are non-interaction terms: the number of entities (positive), 
the number of dependencies (positive), and the number of MANY cardinalities (negative). 
Conditional independencies occur more frequently when there are more entities and de- 
pendencies; however, non-simple relational variables occur more frequently when there are 
more MANY cardinalities. 

This experiment suggests that applying d-separation directly to a relational model will 
frequently derive incorrect conditional independence facts. Additionally, there is a large 
class of conditional independence queries involving non-simple parent or child variables 
for which such an approach is undefined. Together, this indicates that fully specifying 



29 



Maier, Marazopoulou, and Jensen 



abstract ground graphs and applying d-separation augmented with intersection variables 
(as described in Section [s]) is critical for accurately deriving most conditional independence 
facts from relational models. 



7. Experiments 



To complement the theoretical results, we present a series of three experiments on synthetic 
data. The primary goal of these empirical results is to demonstrate the feasibility of applying 
relational d-separation in practice. The experiment in Section 7.1 describes the factors 



that influence the size of abstract ground graphs and thus the computational complexity 
of relational ^-separation. The experiment in Section 7.2 evaluates the growth rate of 



separating sets produced by relational d-separation as abstract ground graphs become large. 
The results indicate that minimal separating sets grow more slowly than abstract ground 
graphs. The experiment in Section 7.3 tests how the expectations of the relational d- 
separation theory match statistical conclusions on simulated data. As expected from the 



proofs of correctness in Section 5.2 the results indicate a close match, aside from Type I 
errors and certain biases of conventional statistical tests on relational data. 



7.1 Abstract Ground Graph Size 

Abstract ground graphs (AGGs) explicitly represent the intersection among relational vari- 
ables and extend the canonically specified dependencies of relational models. Consequently, 
it is important to quantify the size of an AGG and determine which factors influence its 
size. We randomly generated relational schemas and models for 1,000 trials of each setting 
using the following parameters: 



• Number of entity classes, ranging from 1 to 4. 

• Number of relationship classes, ranging from to 4. The schema is guaranteed to 
be fully connected (e.g., schemas with three entity classes require at least two rela- 
tionships) and includes at most a single relationship between a pair of entities (e.g., 
schemas with two entity classes must have exactly one relationship class). Relation- 
ship cardinalities are selected uniformly at random. 

• Number of attributes for each entity and relationship class randomly drawn from a 
shifted Poisson distribution with mean A = 1.0 (~ Pois{1.0) + 1). 

• Number of dependencies in the model, ranging from 1 to 15. 



This procedure generated a total of 450,000 abstract ground graphs across all perspec- 
tives, including all entity and relationship classes for each experimental combination. We 
measure size as the number of nodes and edges appearing in a given abstract ground graph. 

depicts how AGG size varies with respect to the number of many cardi- 



Figure 10(a^ 



nalities in the schema (fixed for models with 10 dependencies), and Figure 10(b) shows how 
it varies with respect to the number of dependencies in the model. Note that for a single 
entity, abstract ground graphs are equivalent to Bayesian networks. In order to determine 
the most infiuential factors of AGG size, we ran log-linear regression using independent 



30 



Independence in Models of Relational Data 



# Nodes (ent. persp.) 

# Edges (ent. persp.) 

# Nodes (rel. persp.) 

# Edges (rel. persp.) 




2 4 6 8 

Number of many cardinalities 

(a) 



N CD 

o 

O O 

< g 



2 4 6 8 10 12 14 

Number of dependencies 

(b) 



Figure 10: AGG size variation as (a) the number of many cardinalities in the schema in- 
creases (dependencies fixed at 10) and (b) the number of dependencies increases. 
Shown with 95% confidence intervals. 



variables that describe only the schema and model. Detailed results are provided in Ap- 
pendix [C} This analysis indicates: (1) As the number of entities, relationships, attributes, 
and MANY cardinalities increases, the AGG grows at an exponential rate with respect to 
both nodes and edges. (2) As the number of dependencies in the model increases, the 
number of edges increases linearly, but the number of nodes is invariant. (3) AGGs with re- 
lationship perspectives are larger than entity perspectives because more relational variables 
can be defined. 

7.2 Minimal Separating Set Size 

Because abstract ground graphs can become large, one might expect that separating set^ 
would also grow to impractical sizes. Fortunately, relational d-separation produces minimal 
separating sets that are empirically observed to be small. We ran 1,000 trials of each setting 
using the following parameters: 

• Number of entity classes, ranging from 1 to 4. 

• Number of relationship classes fixed to one less than the number of entities. Relation- 
ship cardinalities are selected uniformly at random. 

• Total number of attributes across entity and relationship classes fixed to 10. 

• Number of dependencies in the model, ranging from 1 to 10. 

9. If X and Y are rf-separated given Z, then Z is a separating set for X and Y. A separating set Z is 
minimal if there is no proper subset of Z that is also a separating set. 



31 



Maier, Marazopoulou, and Jensen 



_ o 



CD 

E 
E 

O 


N 



CD ^ 
(/) 

i ^ 

CD 
Q. 




Dependencies: 5 
Entities: 1 



10 1 5 10 1 5 10 1 5 10 
2 3 4 



Figure 11: Minimal separating sets have reasonable sizes, growing only with respect to 
model density. 



For each relational model, we identified one minimal separating set for up to 100 randomly 
chosen pairs of conditionally independent relational variables. This procedure generated 
almost 2.5 million pairs of variables. 

To identify a minimal separating set between relational variables X and Y, we modified 



Algorithm 4 devised by Tian et al. (1998) by starting with all parents of X and Y, the 
variables augmented with the intersection variables they subsume in the abstract ground 
graph. While the discovered separating sets are minimal, they are not necessarily of min- 



imum size. Figure 11 shows the frequency of separating set size as both the number of 
entities and dependencies vary. In summation, roughly 83% are marginal independencies 
(having empty separating sets), 13% have separating sets of size 1, and less than 0.1% 
have separating sets with more than 5 variables. The experimental results indicate that 
separating set size is strongly influenced by model density, primarily because the number 
of potential d-connecting paths increases as the number of dependencies increases. 

7.3 Empirical Validity 

As a practical demonstration, we examined how the expectations of the relational d- 
separation theory match the results of statistical tests on actual data. We parameterized 
relational models in a way that produced linear effects with equal contribution from each 
direct cause and standard methods for aggregating terminal sets of relational variables. We 
used simple additive linear equations with independent, normally distributed error, and 
the average aggregate for terminal sets of relational variables. If Y has no parents, then 

Y = e ~ A^fO, 1), and Y = > ; — —avq(X) + .le otherwise. Relational skeletons 

X£par(Y) 

were constructed such that relationship instances were produced via a latent homophily 



process, similar to the method used by Shalizi and Thomas (2011). Each entity instance 



received a single latent variable, marginally independent from all other variables. The prob- 

ability of any relationship instance was drawn from -j , the inverse logistic function, 

1 + e 

where d = \Lei — Le2\, the difference between the latent variables on the two entities, and 
a = 10, set as the decay parameter. We also scaled the probabilities in order to produce 



32 



Independence in Models of Relational Data 



CO 
LLj 

a: 

LJJ 

O 

Q 
LU 



< 

LU 
CO 

Q 



O in 

£ § 

c 


to LO 
. o 

< 




"I 1 1 r 

1 entity 2 entities 3 entities 4 entities 



o 
'c 

W 

d. 
o 




r 

1 entity 2 entities 



r 

3 entities 4 entities 



CO 
UJ 

a: 

LU 

O 

Q 
LU 
I- 

o 

LU 



o 
o 

Q 



O 


O m 

£ § 

C 
0) 

W m 
. o 

1^ 



1 1 1 1 

1 entity 2 entities 3 entities 4 entities 



CO 

o 
'c 

W 

Q. 
P 



1 entity 



1 1 1 

2 entities 3 entities 4 entities 



Figure 12: The relational d-separation theory closely matches the results of statistical tests 
on actual data. 



an expected degree of 5 for each entity instance when the cardinality of the relationship is 
MANY. Since the latent variables arc marginally independent of all others, they are safely 
omitted from abstract ground graphs; their sole purpose was to generate relational skele- 
tons, providing a greater probability of non-empty intersection variables as opposed to a 
random underlying link structure. 

For 100 trials, we randomly generated a relational schema and model using the following 
parameter settings: 

• Number of entity classes, ranging from 1 to 4. 

• Number of relationship classes fixed to one less than the number of entities. Relation- 
ship cardinalities are selected uniformly at random. 

• Number of attributes for each entity and relationship class randomly drawn from a 
shifted Poisson distribution with mean A = 1.0 (~ Pozs(l.O) + 1). 



33 



Maier, Marazopoulou, and Jensen 



• Number of dependencies in the model fixed to 10. 

We then tested up to 100 true and false relational d-separation queries across 100 skeletons 
(i.e., instantiated relational databases) with 1,000 instances of each entity class. To test 
for conditional independence between X and Y given Z {X _LL y | Z ?), we tested the 
t-statistic for the coefficient of avg{X) in the equation y = /3o + (3iavg{X) + j3iavg{Zi). 

The queries were restricted such that X and Y are single relational variables, and the 
relational path for y is a singleton. These queries correspond to testing potential canonical 
(direct causal) dependencies. 

For each query, we recorded the average strength of effect (as measured by the squared 
partial correlation coefficient — the proportion of remaining variance of Y explained by X 
after conditioning on Z) and the proportion of trials for which each query was significant 
(a = 0.01 adjusted with Bonferroni correction with the number of queries per trial). Fig- 
ure 12 shows the distribution of the average strength of effect and proportion of significant 
trials across both true and false queries for varying numbers of entities. The graph uses a 
standard box-and- whisker plot with values greater or less than 1.5 times the inner quartile 
range — the difference between the upper and lower quartiles — marked as outliers. 

In the vast majority of cases, relational d-separation is consistent with tests on actual 
data. For approximately 23,000 true queries, 14.9% are significant in more than one trial, 
but only 2.2% have an average effect size greater than 0.01. Aside from Type I error, a 
small number of cases exhibit an interaction between aggregation and relational structure 
(i.e., degree or the cardinality of terminal sets of relational variables), and even small effects 
produce significant results with a large sample size. This interaction violates the identically 
distributed assumption of data instances, which produces a biased estimate of effect size 
for simple linear regression. Linear regression does not account for these interaction ef- 
fects, suggesting the need for more accurate statistical tests of conditional independence for 
relational data. 



8. Discussion 

In this paper, we extend the theory of d-separation to graphical models of relational data. 
We present the abstract ground graph, a new representation that is {B, /i)-reachably sound 
and complete in its abstraction of dependencies across all possible ground graphs of a given 
relational model. We formally define relational ^-separation and offer a sound, complete, 
and computationally efficient approach to deriving conditional independence facts from re- 
lational models by exploiting their abstract ground graphs. We also show that relational 
d-separation is equivalent to the Markov condition for relational models. We provide an 
empirical analysis of relational d-separation on synthetic data, demonstrating a close cor- 
respondence between the theory and statistical results in practice. Finally, we evaluate 
how frequently the complexity of abstract ground graphs proves necessary for accurately 
deriving conditional independence facts. 

The results of this paper imply potential flaws in the design and analysis of some real- 
world studies. If researchers of social or economic systems choose inappropriate data and 
model representations, then their analyses may omit important classes of dependencies. 
Specifically, our theory implies that choosing a propositional representation from an in- 



34 



Independence in Models of Relational Data 




Figure 13: (a) Two models of a bivariate relational domain with opposite directions of 
causality for a single dependency (relationship class omitted for simplicity) ; (b) 
a single dependency implies additional dependencies among arbitrary relational 
variables, shown here in a fragment of the abstract ground graph for B's per- 
spective; (c) an example relational skeleton; and (d) the ground graphs resulting 
from applying the relational model to the skeleton. 



herently relational domain may lead to serious errors. An abstract ground graph from a 
given perspective defines the exact set of variables that must be included in any propo- 
sitionalization. The absence of any relational variable (including intersection variables) 
may unnecessarily violate causal sufficiency, which could result in the inference of a causal 
dependency where conditional independence was not detected. Our work indicates that 
researchers should carefully consider how to represent their domains in order to accurately 
reason about conditional independence. 

The abstract ground graph representation also presents an opportunity to derive new 
edge orientation rules for algorithms that learn the structure of relational models, such as 



RPC (Maier et al. , 2010). There are unique orientations of edges that are consistent with 



a given pattern of association that can only be recognized in an abstract ground graph. 
For example, in contrast to bivariate IID data, it is simple to establish the direction of 
causality for bivariate relational data. Consider the two bivariate, two-entity relational 
models depicted in Figure [l3|| a). The first model implies that values of X on A entities are 
caused by the values of Y on related B entities. The second model implies the opposite, 
that values of y on i? entities are caused by the values of X on related A entities. For 
simplicity, we show the relationship class only as a dashed line between entity classes and 
omit it from relational paths. 

Figure 13 'b) illustrates a fragment of the abstract ground graph (for hop threshold /i=4) 
that each of the two relational models implies. As expected, the directions of the edges in 
the two abstract ground graphs are counterposed. Both models produce observable statisti- 
cal dependencies for relational variable pairs ([i?].y, [-B, and {\B,A\.X, \B,A,B].Y). 
However, the relational variables [-B].y and [B,A,B].Y have different observable statisti- 



35 



Maier, Marazopoulou, and Jensen 



cal dependencies: In the first model, they are marginally independent and conditionally 
dependent given [i?,74].X, and in the second model, they are marginally dependent and 
conditionally independent given [5,^4]. X. As a result, we can uniquely determine the di- 
rection of causality of the single dependence by exploiting relational structure. (There is 
symmetric reasoning for relational variables from A's perspective, and this result is also 
applicable to ONE-to-MANY data.) 

To illustrate this fact more concretely, consider the small relational skeleton shown in 



Figure 13 c) and the ground graphs applied to this skeleton in Figure 13 ^d). In the first 



ground graph, we have yi _LL y2 and yi -ijL y2 \xi, but in the second ground graph, we have 
yi J/ 2/2 and yi -LL 2/2 \xi. These opposing conditional independence relations uniquely 
determine the correct causal model. 

Deriving and formalizing the implications of relational d-separation is a main direction 
of future research. Additionally, our experiments suggest that more accurate tests of con- 
ditional independence for relational data need to be developed, specifically tests that can 
address the interaction between relational structure and aggregation across terminal sets of 
relational variables. This work has also focused solely on relational models of attributes; 
future work should consider models of relationship and entity existence to fully characterize 
generative models of relational structure. Finally, the theory could also be extended to 
incorporate functional or deterministic dependencies, as Z)-separation extends d-separation 
for Bayesian networks. 



Acknowledgments 

The authors wish to thank Cindy Loiselle for her editing expertise. This effort is sup- 
ported by the Intelligence Advanced Research Project Agency (lARPA) via Department of 
Interior National Business Center Contract number D11PC20152, Air Force Research Lab 
under agreement number FA8750-09-2-0187 and Science Applications International Corpo- 
ration (SAIC) and DARPA under contract number P010089628. The U.S. Government is 
authorized to reproduce and distribute reprints for governmental purposes notwithstanding 
any copyright notation hereon. The views and conclusions contained herein are those of 
the authors and should not be interpreted as necessarily representing the official policies or 
endorsements either expressed or implied, of lARPA, DoI/NBC, AFRL, SAIC, DARPA or 
the U.S. Government. Katerina Marazopoulou received scholarship support from the Greek 
State Scholarships Foundation. 



Appendix A. 

In this appendix, we provide detailed proofs for all previous lemmas, theorems, and corol- 
laries. 

Lemma 14.11 For any schema S and any two relational paths Pi = [Ii, . . . , Im, ■ ■ ■ , Ik] dnd 

P2 = [h, . . . , In, ■ ■ ■ , Ik] with Im In, there exists a skeleton a such that Pi\ij^ Ci P2\ii 7^ 
for some ii G cy{Ii)- 

Proof. Proof by construction. Let S be an arbitrary schema with two arbitrary relational 
paths Pi = [/i, . . . , /m, . . . , /fc] and P2 = [Ii, 4] where / In- We wiU con- 



36 



Independence in Models of Relational Data 




Figure 14: Schematic of two relational paths Pi and P2 for which Lemma 4.1 guarantees 
that some skeleton a yields a nonempty intersection of their terminal sets. The 
example depicts a possible constructed skeleton based on the procedure used in 
the proof of Lemma |4.1[ 



struct a skeleton a such that the terminal sets for item ii £ along Pi and P2 have a 

nonempty intersection, that is, an item G Pi\i^(^P2\ii / O (roughly depicted in Figure 14). 
We use the following procedure to build a: 

1. Simultaneously traverse Pi and P2 from Ii until the paths diverge. For each entity 
class E £ £ reached, add a unique entity instance e to cr{E). 

2. Simultaneously traverse Pi and P2 backwards from until the paths diverge. For 
each entity class E £ £ reached, add a unique entity instance e to cr{E). 

3. For the divergent subpaths of both Pi and P2, add unique entity instances for each 
entity class E £ £. 

4. Repeat 1-3 for relationship classes. For each R £ TZ reached, add a unique relationship 
instance r connecting the entity instances from classes on Pi and P2, and add unique 
entity instances for classes E £ R not appearing on Pi and P2. 

This process constructs an admissible skeleton — all instances are unique and this process 
assumes no cardinality constraints aside from those required by Definition 4.3 By construc- 
tion, there exists an item ii £ cr{Ii) such that Pi|ji n P2\ii = {ik} 7^ 0- ■ 



Lemma 15.11 For any two relational paths Porig = [Ii, - ■ ■ , Ij\ o,nd Pext = [-(;■>••• > Ik] with 
P = extend{Porig, Pext) , VP £ P there exists a relational skeleton a such that 3ii £ 
such that 3ik £ P|j^ and 3ij £ Porig\ii such that i^ £ Pext\ij- 

Proof. Let Porig = Vii ■ ■ ■ i^j] ^-ncl P^xt = [Ijt ■ ■ ^Ik\ be two arbitrary relational paths, 
and let P = extend{Porig , Pext) ■ Let c £ pivots{r ever se{Porig), Pext) be an arbitrary pivot 
such that P = concat{Porig[0 '■ length{Porig) — c + l],Pe2^f[c : length{Pext)]) ■ Since P E P, 
we know that P is a valid relational path, and we can ignore pivots that produce invalid 
paths. There are two subcases: 



37 



Maier, Marazopoulou, and Jensen 




Figure 15: Example construction of a relational skeleton for two relational paths Porig = 
[Ii, . . . , Im ■ ■ ■ , Ij] and Pext = [Ij, ■ ■ ■ , I'm ■ ■ ■ , Ik], where item class Im is repeated 
between Im and Ij. This construction is used within the proof of Lemma 5.1 



(a) c = 1 and P = [Ii, . . . , Ij, . . . , I/.]- This subcase holds generally for any skeleton. 
Proof by contradiction. Let a be an arbitrary skeleton, choose ii € arbitrarily, and 
choose ik £ P\i-^ arbitrarily. Assume for contradiction that there is no ij in the terminal set 
Poriglh such that ifc would be in the terminal set Pext\ij, that is, \/ij £ Poriglh ik ^ Pext\ij- 
Since P = [/i, . . . , /j, . . . , I^], we know that ik is reached by traversing a from ii via some 
ij to ik- But the traversal from ii to ij implies that ij G [/i, . . . = Porig\h-, and the 
traversal from ij to ik implies that ik G [Ij, . . . ,Ik]\ij = Pext\ij- So, there must exist an 

ij G Porig\i-i SUch that ik G Pext\ij' 

(b) c > 1 and P = [Ii, . . . , Im, . . . , I^]. Proof by construction. We build a rela- 



tional skeleton a following the same procedure as outlined in the proof of Lemma 4.1 
Add instances to a for every item class that appears on Porig and Pext- Since P = 
[Ii, . - - , Imt - - - , Ik\, W6 know that ik is reached by traversing a from ii via some im to 
ik- By case (a), 3im G [/i, . . . , Im\\ii such that ik G [Im, • • • ) -^fc]|im- We then must show that 
there exists an ij G [Im, • • • , Ij\\im with im G [Ij, . . . , /m]|ij - But, constructing the skeleton 
with unique item instances for every appearance of an item class on the relational paths 
provides this and does not violate and cardinality constraints. If any item class appears 
more than once, then the bridge burning semantics are induced. However, adding an ad- 
ditional item instance for every reappearance of item classes enables the traversal from ij 
to im and vice versa. An example of this construction is displayed in Figure [15} This is 
also a valid relational skeleton because Porig and Pext are valid relational paths, and by 
definition the cardinality constraints of the schema must permit multiple (many) instances 
in the skeleton of any repeated item class. By this procedure, we show that there exists a 
skeleton a such that there exists an ij G Porig\ix such that ik G Pext\ij- * 



Lemma 15.21 For any relational skeleton a and two relational paths Porig = [Ii, - - - , Ij] and 
Pext = [Ij,---,Ik] with P = extend{Porig,Pext), Vil G cr(/i) Mij G Porig\i^ G Pextlij if 
yP e F ik ^ -P|ji, then 3Porig such that Poriglii n Poriglii / and ik G P'[i^ for some 
P' G extend{P'^^.g,Pext)- 

Proof. Proof by construction. Let a be an arbitrary skeleton, and let ii G (y{Ii), ij G 
Porig\ii-: and ik G Pext\ij be arbitrary instances such that ik ^ P[i^ for any P G P. 

Since ij G Porig\ix and ik G Pext\ij-, but ik ^ P\ii-, there exists no pivot that yields a 
common subsequence in Porig and Pext that produces a path in extend that can reach ik- 



38 



Independence in Models of Relational Data 




Figure 16: Schematic of the relational paths expected in Lemma 5.2 

Pext 



reachable via extend{P^ 

\Il , . . . , Im 1 ■ ■ ■ 1 1ni ■ ■ 



orig: - 



then there must exist a P' 



If item ik is un- 
of the form 



Let the first divergent item class along the reverse of Porig be /; and along Pext be I„. The 
two paths must not only diverge, but they also necessarily reconverge at least once. If Porig 
and Pext do not reconverge, then there are no reoccurrences of an item class along any 
P E P that would restrict the inclusion of ik in some terminal set P\i^. The sole reason 



that ik ^ P\i^ for any P G P is by Definition 4.4, through the bridge burning semantics of 
terminal sets. 

Without loss of generality, assume Pong and Pext reconverge once, at item class Im- 

So, Porig = [h, . . . ,Im, ■ ■ ■ ,Il, ■ ■ ■ ,Ij] and Pext = [Ij , ■ ■ ■ , In, ■ ■ ■ , Im, ■ ■ ■ , 4] with Ii / /„, 

as depicted in Figure 



path because [Ii , , 



16 



Let Porig = [h, ■ ■ ■ , Im, ■ ■ ■ , In, ■ ■ ■ , Ij], which is a valid relational 
is a subpath of Porig and [Im, ■ ■ ■ , In, ■ ■ ■ , Ij] is a subpath of Pext- 
By construction, ij £ Porigk n Po^gk ■ P' = [h, ■ ■ ■ , Im, ■ ■ ■ , h] & extend{P^^ig, Pext) 
with pivot at Im, then ik G P'ln- • 



Theorem 15.21 For any acyclic relational model Ai, perspective B G SUTZ, and hop thresh- 
old h G N*^, the abstract ground graph AGGMBh is {B, h)-reachably sound and complete for 
any ground graph GGjUa for all skeletons a. 

Proof. Let M = {S,T>) be an arbitrary acyclic relational model, let P G U 7^ be an 
arbitrary perspective, and let /i G N*^ be an arbitrary hop threshold. We first show that 
AGGMBh is (P, /i)-reachably sound, and then prove that AGGj^Bh is (P, /i)-reachably 
complete. Figure [7] represents the general connection between abstract ground graphs and 
ground graphs. 

(1) To prove that AGGmbh is (P, /i)-reachably sound, we must show that for every edge 
Pfc-^i — ^ Pj-^2 in AGGj^Bh, there exists a corresponding edge ik-Vi ^j-^2 in the ground 
graph GGmct for some skeleton a where ik G Pk\b and ij G Pj\b for some h G cr{B). There 
are three subcases, one for each type of edge in an abstract ground graph: 

(a) Let [P, . . . , Ik]-yi — ^ [B, ■ ■ ■ , Ij]-V2 S RVE be an arbitrary edge in AGGmbh be- 
tween a pair of relational variables. Assume for contradiction that there exists no edge 
ik-Vi — )• ij-y2 connecting variable instances in any ground graph, that is, V6 G o"(P) yik-Vi G 
[P, . . . , Vij.V2 G [P, . . . , Ij].V2\b ife-Vi — )• ij-V2 ^ GGmct for any skeleton a. By Def- 

inition 5.2, if the abstract ground graph has edge [P, . . . ,Ik]-Vi — >■ [P, . . . ,Ij]-V2 G RVE, 



then the model must have dependency [Ij, . . . ,Ik]-Vi — ?• [-(?■] -1^ G f and the relational 



39 



Maier, Marazopoulou, and Jensen 



path [B, . . . ,Ik] G extend{[B , . . . , Ij],[Ij, . . . , 1^])- So, by Definition 4.10 for ground graphs, 
there is an edge to every ij.V2 from every ifc.Vi in the terminal set for ij along [Ij, . . . , Ik]-Vi, 
that is, Vij G cr(/j) Vifc G [Ij, ■ ■ ■ ,/fc]|j. ik-Vi — ^ S GGm<t for any skeleton o". Since 



[B, . . . , Jfc] G extend{[B , . . . , Ij],[Ij, . . . , I^]), by Lemma 5.1 we know that there exists a 
skeleton a such that G [5, . . . , 3ij G . . . , such that ik G [Ij, . . . , 
for some b G cr{B). So, 36 G G [B,...,Ik]\b ^ij G [S, . . . , such that 

ik-^i ~^ G GG^fj for some skeleton a, which contradicts the assumption. 

(b) Let Pi.VinP2-^i [B,. ■ ■ , Ij]-V2 S IVE be an arbitrary edge in AGGMBh between 
an intersection variable and a relational variable, where Pi= [B, . . . , Im, ■ ■ ■ , Ik] and P2 = 



[B, . . . , /„, . . . , /fc] with Jm 7^ /„. By Lemma 4.1 there exists a skeleton a such that Pi\h H 



-P2I6 7^ for some b G Let G -Pilfc n P2\b for such a 6 G for a. Assume 



for contradiction that Vij G [P, . . . , ifc.Vi — )• ij.V2 ^ GGmct- By Definition 5.2, if 
the abstract ground graph has edge Pi.Vi n P2-Vi — [P, . . . , /j].V2 G IVE, then either 
Pi.Fl ^ [B,..., Ij].V2 G PV'^ or P2.V1 ^[B,..., Ij].V2 G PFE^. Then, as shown in case 
(a), 3ij G [B, . . . , Ij]\h such that ifc.Vi — >• ij-V2 G GGj^^j, which contradicts the assumption. 

(c) Let [B,... Jk]-Vi Pi-V2 n P2-^2 e be an arbitrary edge in AGGmbh be- 

tween a relational variable and an intersection variable, where Pi = [B, . . . ,Im, ■ ■ ■ ,Ij] 
and P2 = [B, . . . ,In, . . . ,Ij] with 1^ 7^ /n- The proof follows case (b) to show that 
Vzfc G [B, . . . , IkWb^ij G P1.V2 n P2-V2\b such that ifc-Vi — )• ij-V2 G GGmct for some skeleton 
cr and 6 G for which G Pi];, H P2|fe. 

(2) To prove that the abstract ground graph AGGj^Bh is (P, /i)-reachably complete, 
we show that for every (P, /i)-reachable edge ik-Vi — )• in any ground graph GGmct 

for some skeleton a, there is a corresponding edge X ^ Y in AGGj^Bh- Specifically, the 
(P, /i)-reachable edge ik-Vi — ^j-^ yields two sets of relational variables for some b G cr{B), 
namely Pk-Vi = {Pk-Vi \ ik S Pk\b A length{Pk) < h + 1} and Pj.V2 = {Pj.T^2 | ij e 
Pj If, A length{Pj) < h+1} by Definition 5.4 Note that all relational variables in both Pk.Vi 



and Pj.V2 are nodes in AGGMBh, as are all pairwise intersection variables: VPfc.Vi, P^.Vi G 
Pk.Vi Pk.Vi n Pj^.Vi G AGGMBh and VP^.Vs, Pj.F2 G Pj.Va Pj.F2 n P!j.V2 G ^GGa^b^. 

Let o" be an arbitrary skeleton, and let ik-Vi — )• ij-V2 G GGm<t be an arbitrary {B,h)- 
reachable edge drawn from [Ij, . . . ,Ik].Vi — )• [/j].V2 G V where such that Pk-Vi 7^ and 
Pj.Va 0. We show that VP^.^i G Pk-Vi yPj.V2 G Pj.Va either (a) Pfc.Fi ^ P,-.y2 G 
AGGa^b;,, (b) Pk.Vi n P^.l/i ^ P,.y2 G vIGGa^b/., where P^.Vl G Pk-Vi, or (c) Pk-Vi ^ 
Pj.V2nP^.V2 G ^GG^^Bh, where P;.y2 G Pj-Va- Let P^.^i G Pk.Vi,P,-.y2 G Pj.Va be an 
arbitrary pair of relational variables drawn from the sets Pk.Vi and Pj.V2. 

(a) If Pk G extend{Pj, [Ij,. - ■ , h]), then Pk-Vi Pj-V2 G AGGmbk by Definition 



5.2 



(b) If Pk i extend{Pj, [Ij,..., h]), but 3P( G extend{Pj, [Ij,..., Ik]) such that P^.Vi G 
Pk.Vi, then P^.Fi ^ Pj.y2 G AGGa^b/., and Pfc.Vi n P^.Fi ^ P,.V2 G ^GG^^B/x by 



Definition [521 

(c) If VP G extend{Pj, [Ij, . . . ,4])P.Fi ^ Pk-Vi, then, by Lemma 
ij G P'Ab and Pfc G extend{P', [I„ . . . , 4]). So, P'.^s G Pj.Va, Pfc.Fi ^ K 



5.2 



and Pfc.Fi ^ Pj.V2 n P,-.y2 G AGGaib/, by Definition 



3Pj such that 
■2 G AGGa4B/i, 



5.2 



40 



Independence in Models of Relational Data 









A 


>■ < 


B 



(1) [B,A].X ^[B].Y 

(2) [B,A,B,A].X ^[B].Y 

(a) Relational model 



Bridge burning f X2 




No bridge burning 




0r - - - g 



(b) Relational skeleton 



(2) 



(1)&(2) 






t 



•Identical to 



(c) Ground graphs 

Figure 17: Example demonstrating that bridge burning semantics yields a more expressive 
class of models than semantics without bridge burning, (a) Relational model over 
a schema with two entity classes and two attributes with two possible relational 
dependencies (relationship class omitted for simplicity), (b) Simple relational 
skeleton with three A and three B instances, (c) Bridge burning semantics 
yields three possible ground graphs with combinations of dependencies (1) and 
(2) whereas no bridge burning yields two possible ground graphs. The bridge 
burning ground graphs subsume the no bridge burning ground graphs. 



Appendix B. 

In this appendix, we provide an example to show that the bridge burning semantics for 
terminal sets of relational paths yields a strictly more expressive class of relational models 
than semantics without bridge burning. The bridge burning semantics produces terminal 
sets that are necessarily subsets of terminal sets which would otherwise be produced without 
bridge burning. Paradoxically, this enables a superset of relational models. 



41 



Maier, Marazopoulou, and Jensen 



Recall the definition of a terminal set for a relational path: 



Definition 4.4 (Terminal set of a relational path) For any skeleton a^-ji and any ii G 



(t(/i), a terminal set P\i^ for relational path P = [/i, . . . can be defined inductively as 

[^i]|.i={n} 

[Ii, . . . = [J {ik I {{ik-i G 4 if h ^T^) V (ifc G 4_i if h G <?)) 

i.-ie[/i,...,4-i]lH A ife ^ [h, . . for j = 1 to k - 1} 

The final condition in the inductive definition {i^. ^ [-^i, • • • I ti for j = 1 to A: — 1) 
encodes bridge burning. The item is only added to the terminal set if it is not a member 
of the terminal set of any previous subpath. For example, let P be the relational path 
[Employee, Develops, Product, Develops, Employee]. This relational path produces 
terminal sets that include the employees that work on the same products: co-workers. 
Instantiating this path with the employee Quinn, P|Quinn, produces the terminal set {Paul, 
Roger, Sally}. Since Quinn G [Employee]! Quinn j the bridge burning semantics excludes 
Quinn from this set. This makes intuitive sense as well — Quinn should not be considered 
her own colleague. 

A relational model is simply a collection of relational dependencies. Each relational 
dependency is primarily described by the relational path of the parent relational variable 
(because, for canonically specified dependencies, the relational path of the child consists of 
a single item class) . The relational path specification is used in the construction of ground 
graphs, connecting variable instances that appear in the terminal sets of the parent and 
child relational variables. 

To characterize the expressiveness of relational models, we can inspect the space of 
representable ground graphs by choosing an arbitrary relational skeleton and a small set of 
relational dependencies. We show with a simple example that the bridge burning semantics 
for a model over a two-entity, bivariate schema yields more possible ground graphs than 
without bridge burning. (We omit the relationship class for simplicity.) In Figure 17(a)| 



we present such a model with two possible relational dependencies labeled (1) and (2). 



Figure 17(b) provides a very simple relational skeleton involving three A and three B 



instances (relationship instances are represented as dashed lines for simplicity). As shown 



in Figure 17(c) , the bridge burning semantics leads to three possible ground graphs, one 
for each combination of the dependencies (1), (2), and both (1) and (2) together. Without 
bridge burning, only two ground graphs are possible because dependency (2) completely 
subsumes dependency (1) with those semantics. 

This example generalizes to arbitrary dependencies. The terminal sets of relational 
paths that repeat item classes subsume subpaths under no bridge burning semantics. This 
leads to fewer possible relational models, which justifies our choice of semantics for terminal 
sets of relational paths. 

Appendix C. 



In this appendix, we provide additional details for the experiment in Section 7.1 The 



goal of this experiment is to determine which factors influence the size of abstract ground 



42 



Independence in Models of Relational Data 





1 .opih r*! PTit 


J. di Uldi 




# relationships 


1.15 


0.451 


0.153 


^ MANY cardinalities x isEntity=F 


1.11 


0.359 


0.128 


# entities 


-0.74 


0.352 


0.101 


# many cardinalities x isEntity=T 


0.91 


0.217 


0.069 


# many cardinalities x ^ relationships 


-0.33 


0.108 


0.022 


# attributes 


0.08 


0.024 


0.005 



Table 1: Estimated standardized cofficient, squared partial correlation coefficient, and 
squared semipartial correlation coefficient for each predictor in the log-linear model 
of the number of nodes in an abstract ground graph. 



graphs because the computational complexity of relational (i-separation depends on its 
size. Specifically, we show here the results of running log-linear regression to predict the 
size of abstract ground graphs for varying schemas and models. We standardized the input 



variables by dividing by two standard deviations, as recommended by Gelman (2008). Since 
the predictor for the number of dependencies is log-transformed, the standardization for that 
variable occurs after taking the logarithm. 

In predicting the (log of the) number of nodes, the following variables were significantly 
and substantively predictive (in order of decreasing predictive power): 

• Number of relationships (positive) 

• Interaction between many cardinalities and an indicator variable for whether the 
AGG is from an entity or relationship perspective (positive) 

• Number of entities (negative) 

• Interaction between the number of MANY cardinalities and relationships (negative) 

• Total number of attributes (positive) 

The fit for the nodes model has an = 0.814 for n = 449, 993, and Table [l] contains 
the standardized coefficients as well as the squared partial and semipartial correlation co- 
efficients for each predictor. In predicting the (log of the) number of edges, the following 
variables were significantly and substantively predictive (in order of decreasing predictive 
power): 

• Log of the number of dependencies (positive) 

• Number of relationships (positive) 

• Interaction between many cardinalities and an indicator variable for whether the 
AGG is from an entity or relationship perspective (positive) 

• Number of entities (negative) 

• Interaction between the number of many cardinalities and relationships (negative) 



43 



Maier, Marazopoulou, and Jensen 





1 .opih r*! PTit 


J. di UlCli 




log(# dependencies) 


0.41 


0.440 


0.165 


# relationships 


1.09 


0.395 


0.138 


^ MANY cardinalities x isEntity=F 


1.21 


0.356 


0.135 


# entities 


-0.78 


0.353 


0.115 


# many cardinalities x isEntity=T 


1.00 


0.231 


0.077 


# many cardinalities x # relationships 


-0.38 


0.127 


0.031 



Table 2: Estimated standardized cofficient, squared partial correlation coefficient, and 
squared semipartial correlation coefficient for each predictor in the log-linear model 
of the number of edges in an abstract ground graph. 

The fit for the edges model has an = 0.789 for n = 449,993, and Table [2] contains 
the standardized coefficients and the squared partial and semipartial correlation coefficients 
for each predictor. 

References 

Richard Barker. CASE Method: Entity Relationship Modeling. Addison- Wesley, Boston, 
MA, 1990. 

Wray L. Buntine. Operations for Learning with Graphical Models. Journal of Artificial 
Intelligence Research, 2:159-225, 1994. 

Nir Friedman. Inferring cellular networks using probabilistic graphical models. Science, 303 
(5659):799-805, 2004. 

Dan Geiger and Judea Pearl. On the logic of causal models. In Proceedings of the Fourth 
Annual Conference on Uncertainty in Artificial Intelligence, pages 136-147, 1988. 

Dan Geiger, Thomas Verma, and Judea Pearl. Identifying independence in Bayesian net- 
works. Networks, 20(5):507-534, 1990. 

Andrew Gelman. Scaling regression inputs by dividing by two standard deviations. Statistics 
in Medicine, 27(15) :2865~2873, 2008. 

Andrew Gelman and Jennifer Hill. Data Analysis Using Regression and Multi- 
level/Hierarchical Models. Cambridge University Press New York, 2007. 

Lise Getoor. Learning Statistical Models from Relational Data. Ph.D. thesis, Stanford 
University, 2001. 

Lise Getoor and Ben Taskar, editors. Introduction to Statistical Relational Learning. MIT 
Press, Cambridge, MA, 2007. 

Lise Getoor, Nir Friedman, Daphne Koller, Avi Pfeffer, and Ben Taskar. Probabilistic 
relational models. In Lise Getoor and Ben Taskar, editors. Introduction to Statistical 
Relational Learning, pages 129-174. MIT Press, Cambridge, MA, 2007. 



44 



Independence in Models of Relational Data 



W. R. Gilks, A. Thomas, and D. J. Spiegelhalter. A language and program for complex 
bayesian modeling. The Statistician, 43:169-177, 1994. 

David Heckerman, Chris Meek, and Daphne KoUer. Probabilistic entity-relationship mod- 
els, PRMs, and plate models. In Lise Getoor and Ben Taskar, editors, Introduction to 
Statistical Relational Learning, pages 201-238. MIT Press, Cambridge, MA, 2007. 

Michael G. Hudgens and M. Elizabeth Halloran. Toward causal inference with interference. 
Journal of the American Statistical Association, 103(482) :832-842, 2008. 

Stefan Kramer, Nada Lavrac, and Peter Flach. Propositionalization approaches to relational 
data mining. In Saso Dzeroski and Nada Lavrac, editors. Relational Data Mining, pages 
262-286. Springer- Verlag, New York, NY, 2001. 

Marc Maier, Brian Taylor, Hiiseyin Oktay, and David Jensen. Learning causal models of 
relational domains. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial 
Intelligence, 2010. 

Richard E. Neapolitan. Learning Bayesian Networks. Pearson Prentice Hall Upper Saddle 
River, NJ, 2004. 

Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 
New York, NY, 2000. 

Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 
Morgan Kaufmann, San Pransico, CA, 1988. 

Richard Scheines. An introduction to causal inference. In Vaughan R. McKim and Steven P. 
Turner, editors. Causality in Crisis? Statistical Methods and the Search for Causal 
Knowledge in the Social Sciences, pages 185-199. University of Notre Dame Press, 1997. 

Eran Segal, Ben Taskar, Audrey Gasch, Nir Friedman, and Daphne Roller. Rich probabilis- 
tic models for gene expression. Bioinformatics, 17(suppl 1):S243-S252, 2001. 

Cosma R. Shalizi and Andrew C. Thomas. Homophily and contagion are generically con- 
founded in observational social network studies. Sociological Methods & Research, 40(2): 
211-239, 2011. 

Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction and Search. 
MIT Press, Cambridge, MA, 2nd edition, 2000. 

Ben Taskar, Eran Segal, and Daphne Koller. Probabilistic classification and clustering 
in relational data. In Proceedings of the Seventeenth International Joint Conference on 
Artificial Intelligence, pages 870-878, 2001. 

Eric J. Tchetgen Tchetgen and Tyler J. VanderWeele. On causal inference in the presence 
of interference. Statistical Methods in Medical Research, 21(l):55-75, 2012. 

Jin Tian, Azaria Paz, and Judea Pearl. Finding Minimal D-separators. Technical Report 
R-254, UCLA Computer Science Department, February 1998. 



45 



Maier, Marazopoulou, and Jensen 



loannis Tsamardinos, Laura E. Brown, and Constantin F. Aliferis. The max-min hill- 
climbing Bayesian network structure learning algorithm. Machine Learning, 65(l):31-78, 
October 2006. 

Thomas Vcrma and Judca Pearl. Causal networks: Semantics and expressiveness. In 
Proceedings of the Fourth Annual Conference on Uncertainty in Artificial Intelligence, 
pages 352-359, 1988. 



46